Local Policy Recommendations for Crime Reduction

Introduction

In this report, we seek to examine and discuss determinants of crime and offer recommend actionable policy recommendations for local politicians running for election at the county level. For our analysis, we draw on sample data collected from a study by Cornwell and Trumball, researchers from the University of Georgia and West Virginia University. Our sample data includes data on crime rates, arrests, sentences, demographics, local weekly wages, tax revenues and more drawn from local and federal government data sources. Although the age of the data may be a potential limitation of our study, we believe the insights we gather and policy recommendations remain appropriate for local campaigns today.

Our primary question that will drive our data exploration are to ask which variables affect crime rate the most.

Exploratory Data Analysis (EDA)

Variables

The crime_v2 dataset provided includes 25 variables of interest.

We include them below for reference by category of interest.

Category Variable
Crime Rate crmrte
Geographic county, west, central
Demographic urban, density, pctmin80, pctymle
Economic - Wage wcon, wtuc, wtrd, wfir, wser, wmfg, wfed, wsta, wloc
Economic - Revenue taxpc
Law Enforcment polpc, prbarr, prbconv, mix
Judicial/Sentencing prbpris, avgsen
Time Period year

The variables above operationalize the conditions we wish to explore and their affects on crime rate

Chiefly, these break down as follows.

  • The Economic variables measures the county’s economic activity and health (e.g. opportunity to pursue legal forms of income). These variables come in the form of available wages and tax revenue returned to the county.

  • The Law enforcment variables measures the county’s ability to utilize law enforcment policy to deter crime. Similarly, the Judicial variables also signify impact of deterence to crime.

  • The Demographic variables measure the cultural variability that represent the social differences between each county, such as urban vs rural and minority populations.

  • The Geographic elements are categorical. They represent they ways in which the population is segmented by geography.

Data Prep and Exploratory Analysis

We begin our analysis by loading the data set and performing basic checks and inspections.

dfCrime = read.csv("crime_v2.csv")
str(dfCrime)
'data.frame':   97 obs. of  25 variables:
 $ county  : int  1 3 5 7 9 11 13 15 17 19 ...
 $ year    : int  87 87 87 87 87 87 87 87 87 87 ...
 $ crmrte  : num  0.0356 0.0153 0.013 0.0268 0.0106 ...
 $ prbarr  : num  0.298 0.132 0.444 0.365 0.518 ...
 $ prbconv : Factor w/ 92 levels "","`","0.068376102",..: 63 89 13 62 52 3 59 78 42 86 ...
 $ prbpris : num  0.436 0.45 0.6 0.435 0.443 ...
 $ avgsen  : num  6.71 6.35 6.76 7.14 8.22 ...
 $ polpc   : num  0.001828 0.000746 0.001234 0.00153 0.00086 ...
 $ density : num  2.423 1.046 0.413 0.492 0.547 ...
 $ taxpc   : num  31 26.9 34.8 42.9 28.1 ...
 $ west    : int  0 0 1 0 1 1 0 0 0 0 ...
 $ central : int  1 1 0 1 0 0 0 0 0 0 ...
 $ urban   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pctmin80: num  20.22 7.92 3.16 47.92 1.8 ...
 $ wcon    : num  281 255 227 375 292 ...
 $ wtuc    : num  409 376 372 398 377 ...
 $ wtrd    : num  221 196 229 191 207 ...
 $ wfir    : num  453 259 306 281 289 ...
 $ wser    : num  274 192 210 257 215 ...
 $ wmfg    : num  335 300 238 282 291 ...
 $ wfed    : num  478 410 359 412 377 ...
 $ wsta    : num  292 363 332 328 367 ...
 $ wloc    : num  312 301 281 299 343 ...
 $ mix     : num  0.0802 0.0302 0.4651 0.2736 0.0601 ...
 $ pctymle : num  0.0779 0.0826 0.0721 0.0735 0.0707 ...
head(dfCrime)
  county year    crmrte   prbarr     prbconv  prbpris avgsen      polpc
1      1   87 0.0356036 0.298270 0.527595997 0.436170   6.71 0.00182786
2      3   87 0.0152532 0.132029 1.481480002 0.450000   6.35 0.00074588
3      5   87 0.0129603 0.444444 0.267856985 0.600000   6.76 0.00123431
4      7   87 0.0267532 0.364760 0.525424004 0.435484   7.14 0.00152994
5      9   87 0.0106232 0.518219 0.476563007 0.442623   8.22 0.00086018
6     11   87 0.0146067 0.524664 0.068376102 0.500000  13.00 0.00288203
    density    taxpc west central urban pctmin80     wcon     wtuc
1 2.4226327 30.99368    0       1     0 20.21870 281.4259 408.7245
2 1.0463320 26.89208    0       1     0  7.91632 255.1020 376.2542
3 0.4127659 34.81605    1       0     0  3.16053 226.9470 372.2084
4 0.4915572 42.94759    0       1     0 47.91610 375.2345 397.6901
5 0.5469484 28.05474    1       0     0  1.79619 292.3077 377.3126
6 0.6113361 35.22974    1       0     0  1.54070 250.4006 401.3378
      wtrd     wfir     wser   wmfg   wfed   wsta   wloc        mix
1 221.2701 453.1722 274.1775 334.54 477.58 292.09 311.91 0.08016878
2 196.0101 258.5650 192.3077 300.38 409.83 362.96 301.47 0.03022670
3 229.3209 305.9441 209.6972 237.65 358.98 331.53 281.37 0.46511629
4 191.1720 281.0651 256.7214 281.80 412.15 328.27 299.03 0.27362204
5 206.8215 289.3125 215.1933 290.89 377.35 367.23 342.82 0.06008584
6 187.8255 258.5650 237.1507 258.60 391.48 325.71 275.22 0.31952664
     pctymle
1 0.07787097
2 0.08260694
3 0.07211538
4 0.07353726
5 0.07069755
6 0.09891920
tail(dfCrime)
   county year crmrte prbarr prbconv prbpris avgsen polpc density taxpc
92     NA   NA     NA     NA              NA     NA    NA      NA    NA
93     NA   NA     NA     NA              NA     NA    NA      NA    NA
94     NA   NA     NA     NA              NA     NA    NA      NA    NA
95     NA   NA     NA     NA              NA     NA    NA      NA    NA
96     NA   NA     NA     NA              NA     NA    NA      NA    NA
97     NA   NA     NA     NA       `      NA     NA    NA      NA    NA
   west central urban pctmin80 wcon wtuc wtrd wfir wser wmfg wfed wsta
92   NA      NA    NA       NA   NA   NA   NA   NA   NA   NA   NA   NA
93   NA      NA    NA       NA   NA   NA   NA   NA   NA   NA   NA   NA
94   NA      NA    NA       NA   NA   NA   NA   NA   NA   NA   NA   NA
95   NA      NA    NA       NA   NA   NA   NA   NA   NA   NA   NA   NA
96   NA      NA    NA       NA   NA   NA   NA   NA   NA   NA   NA   NA
97   NA      NA    NA       NA   NA   NA   NA   NA   NA   NA   NA   NA
   wloc mix pctymle
92   NA  NA      NA
93   NA  NA      NA
94   NA  NA      NA
95   NA  NA      NA
96   NA  NA      NA
97   NA  NA      NA
summary(dfCrime)
     county           year        crmrte             prbarr       
 Min.   :  1.0   Min.   :87   Min.   :0.005533   Min.   :0.09277  
 1st Qu.: 52.0   1st Qu.:87   1st Qu.:0.020927   1st Qu.:0.20568  
 Median :105.0   Median :87   Median :0.029986   Median :0.27095  
 Mean   :101.6   Mean   :87   Mean   :0.033400   Mean   :0.29492  
 3rd Qu.:152.0   3rd Qu.:87   3rd Qu.:0.039642   3rd Qu.:0.34438  
 Max.   :197.0   Max.   :87   Max.   :0.098966   Max.   :1.09091  
 NA's   :6       NA's   :6    NA's   :6          NA's   :6        
        prbconv      prbpris           avgsen           polpc         
            : 5   Min.   :0.1500   Min.   : 5.380   Min.   :0.000746  
 0.588859022: 2   1st Qu.:0.3648   1st Qu.: 7.340   1st Qu.:0.001231  
 `          : 1   Median :0.4234   Median : 9.100   Median :0.001485  
 0.068376102: 1   Mean   :0.4108   Mean   : 9.647   Mean   :0.001702  
 0.140350997: 1   3rd Qu.:0.4568   3rd Qu.:11.420   3rd Qu.:0.001877  
 0.154451996: 1   Max.   :0.6000   Max.   :20.700   Max.   :0.009054  
 (Other)    :86   NA's   :6        NA's   :6        NA's   :6         
    density            taxpc             west           central      
 Min.   :0.00002   Min.   : 25.69   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.54741   1st Qu.: 30.66   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.96226   Median : 34.87   Median :0.0000   Median :0.0000  
 Mean   :1.42884   Mean   : 38.06   Mean   :0.2527   Mean   :0.3736  
 3rd Qu.:1.56824   3rd Qu.: 40.95   3rd Qu.:0.5000   3rd Qu.:1.0000  
 Max.   :8.82765   Max.   :119.76   Max.   :1.0000   Max.   :1.0000  
 NA's   :6         NA's   :6        NA's   :6        NA's   :6       
     urban            pctmin80           wcon            wtuc      
 Min.   :0.00000   Min.   : 1.284   Min.   :193.6   Min.   :187.6  
 1st Qu.:0.00000   1st Qu.: 9.845   1st Qu.:250.8   1st Qu.:374.6  
 Median :0.00000   Median :24.312   Median :281.4   Median :406.5  
 Mean   :0.08791   Mean   :25.495   Mean   :285.4   Mean   :411.7  
 3rd Qu.:0.00000   3rd Qu.:38.142   3rd Qu.:314.8   3rd Qu.:443.4  
 Max.   :1.00000   Max.   :64.348   Max.   :436.8   Max.   :613.2  
 NA's   :6         NA's   :6        NA's   :6       NA's   :6      
      wtrd            wfir            wser             wmfg      
 Min.   :154.2   Min.   :170.9   Min.   : 133.0   Min.   :157.4  
 1st Qu.:190.9   1st Qu.:286.5   1st Qu.: 229.7   1st Qu.:288.9  
 Median :203.0   Median :317.3   Median : 253.2   Median :320.2  
 Mean   :211.6   Mean   :322.1   Mean   : 275.6   Mean   :335.6  
 3rd Qu.:225.1   3rd Qu.:345.4   3rd Qu.: 280.5   3rd Qu.:359.6  
 Max.   :354.7   Max.   :509.5   Max.   :2177.1   Max.   :646.9  
 NA's   :6       NA's   :6       NA's   :6        NA's   :6      
      wfed            wsta            wloc            mix         
 Min.   :326.1   Min.   :258.3   Min.   :239.2   Min.   :0.01961  
 1st Qu.:400.2   1st Qu.:329.3   1st Qu.:297.3   1st Qu.:0.08074  
 Median :449.8   Median :357.7   Median :308.1   Median :0.10186  
 Mean   :442.9   Mean   :357.5   Mean   :312.7   Mean   :0.12884  
 3rd Qu.:478.0   3rd Qu.:382.6   3rd Qu.:329.2   3rd Qu.:0.15175  
 Max.   :598.0   Max.   :499.6   Max.   :388.1   Max.   :0.46512  
 NA's   :6       NA's   :6       NA's   :6       NA's   :6        
    pctymle       
 Min.   :0.06216  
 1st Qu.:0.07443  
 Median :0.07771  
 Mean   :0.08396  
 3rd Qu.:0.08350  
 Max.   :0.24871  
 NA's   :6        

First, we note there are missing rows in the dataset that were imported. We’ll remove those rows now.

nrow(dfCrime)
[1] 97
dfCrime <-na.omit(dfCrime) # omit the NA rows
nrow(dfCrime)
[1] 91

Next, we will inspect the data to see if there are duplicate records

dfCrime[duplicated(dfCrime),]
   county year    crmrte   prbarr     prbconv  prbpris avgsen      polpc
89    193   87 0.0235277 0.266055 0.588859022 0.423423   5.86 0.00117887
     density    taxpc west central urban pctmin80     wcon     wtuc
89 0.8138298 28.51783    1       0     0  5.93109 285.8289 480.1948
       wtrd     wfir     wser   wmfg   wfed   wsta   wloc       mix
89 268.3836 365.0196 295.9352 295.63 468.26 337.88 348.74 0.1105016
      pctymle
89 0.07819394

A duplicate row exists. We’ll remove it.

dfCrime <- dfCrime[!duplicated(dfCrime),] # remove the duplicated row
nrow(dfCrime)
[1] 90

We also saw that pbconv was coded as a level. It is not a level but a ratio. We’ll change that now.

dfCrime$prbconv<-as.numeric(levels(dfCrime$prbconv))[dfCrime$prbconv]

We also notice by comparision of pctymle and pctmin80 one of the variables is off by a factor of 100. We will divide pctmin80 by 100 so the two variables are in the same unit terms.

dfCrime$pctmin80<-dfCrime$pctmin80/100

County was expressed as a number. However, it is a categorical variable and we will convert it to a factor instead.

dfCrime$county<-as.factor(dfCrime$county)

Next we inspect the indicator variables to see if they were coded correctly.

dfCrime %>% group_by(west, central) %>% tally()
# A tibble: 4 x 3
# Groups:   west [2]
   west central     n
  <int>   <int> <int>
1     0       0    35
2     0       1    33
3     1       0    21
4     1       1     1
dfCrime %>%
filter(west ==1 & central ==1)
  county year    crmrte   prbarr prbconv  prbpris avgsen      polpc
1     71   87 0.0544061 0.243119 0.22959 0.379175  11.29 0.00207028
   density    taxpc west central urban pctmin80     wcon     wtuc     wtrd
1 4.834734 31.53658    1       1     0  0.13315 291.4508 595.3719 240.3673
      wfir     wser   wmfg   wfed   wsta   wloc       mix    pctymle
1 348.0254 295.2301 358.95 509.43 359.11 339.58 0.1018608 0.07939028

One county was either mis-coded, or it truly belongs to both regions. However, this is very unlikely as the intended technique is to widen the data and introduce indicator variables for each category. It is not likley the data was captured for both categories.

We will need further analysis on this datapoint as it relates to the rest of the data.

For now, we will encode a new region variable and place the datapoint in its own category.

#Map central and west to a region code, and create a new category for other
# Note that county 71 has both western and central codes
dfCrime$region <- case_when (
            (dfCrime$central ==0 & dfCrime$west ==0) ~ 0, #Eastern, Coastal, Other
            (dfCrime$central ==0 & dfCrime$west ==1) ~ 1, #Western
            (dfCrime$central ==1 & dfCrime$west ==0) ~ 2, #Central
            (dfCrime$central ==1 & dfCrime$west ==1) ~ 3 #Central-Western border county - this could be Charlotte or it could be a coding error.
        )
dfCrime$regcode =
            factor( dfCrime$region , levels = 0:3 , labels =
                    c( 'O',
                       'W',
                       'C',
                       'CW')
                   )

We will also introduce an indicator variable for counties located in the “other” region that are not west or central

dfCrime$other <- ifelse((dfCrime$central ==0 & dfCrime$west ==0), 1, 0)

And we’ll add an indicator variable to serve as complement to the urban indicator variable and call this ‘nonurban’

dfCrime$nonurban <- ifelse((dfCrime$urban==0), 1, 0)

By way of the 1980 Census fact sheet, we discover the urban field is an encoding for SMSA (Standard Metropolitan Statistical Areas). https://www2.census.gov/prod2/decennial/documents/1980/1980censusofpopu8011uns_bw.pdf The value is one if the county is inside a metropolitan area. Otherwise, if the county is outisde a metropolitan area, the value is zero.

We create a metro factor variable to better describe this feature.

# create factor for SMSA (standard metropolitan statistical areas) with two levels - inside or outside
#    https://www2.census.gov/prod2/decennial/documents/1980/1980censusofpopu8011uns_bw.pdf
dfCrime$metro =
            factor( dfCrime$urban , levels = 0:1 , labels =
                    c( 'Outside',
                       'Inside'
                      )
                   )

Next we will visualize each variable and its relationship to the variable crmrte through scatter plots

#Plot of the economic and tax related variables vs crmrte
q1<-ggplot(data = dfCrime, aes(x = wcon, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = wtuc, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = wtrd, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q4<-ggplot(data = dfCrime, aes(x = wfir, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q5<-ggplot(data = dfCrime, aes(x = wser, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q6<-ggplot(data = dfCrime, aes(x = wmfg, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q7<-ggplot(data = dfCrime, aes(x = wfed, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q8<-ggplot(data = dfCrime, aes(x = wsta, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q9<-ggplot(data = dfCrime, aes(x = wloc, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q10<-ggplot(data = dfCrime, aes(x = taxpc, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
grid.arrange(q1, q2, q3, q4, q5, q6, q7, q8, q9, q10, ncol=2)

We observe a few data points of interest in the comparison above, notably, wser and taxpc appear to have extreme data points

Other variables show outliers as well, but not as extreme. We will see if any of these points have leverage or influence if chosen for models.

For now, lets dig further into the extreme outliers from our immediate visual inspection.

dfCrime %>%
filter(wser > 2000) %>%
select(county, wser)
  county     wser
1    185 2177.068

This average service wage is much too high based on what we know about the 1980s and every other wage recorded in comparison. A review of the detailed population statistics describing mean wage per industry (table 231) confirms this. https://www2.census.gov/prod2/decennial/documents/1980/1980censusofpopu801352uns_bw.pdf

We will adjust this wage by replacing it with an imputed value from the sample population. To impute this value we will rely on the package Hmisc to derive it for us.

dfCrime$wser[which(dfCrime$county==185)]<-NA # set the value to NA so it will be imputed
impute_arg <- aregImpute(~ crmrte +  urban + central + west + other +
                         prbarr + prbconv + prbpris + avgsen + polpc + 
                         density + taxpc + pctmin80 + wcon + wtuc +
                         wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
                         mix + pctymle, data = dfCrime, n.impute = 30)
Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
Iteration 11 
Iteration 12 
Iteration 13 
Iteration 14 
Iteration 15 
Iteration 16 
Iteration 17 
Iteration 18 
Iteration 19 
Iteration 20 
Iteration 21 
Iteration 22 
Iteration 23 
Iteration 24 
Iteration 25 
Iteration 26 
Iteration 27 
Iteration 28 
Iteration 29 
Iteration 30 
impute_arg

Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~crmrte + urban + central + west + other + 
    prbarr + prbconv + prbpris + avgsen + polpc + density + taxpc + 
    pctmin80 + wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + 
    wsta + wloc + mix + pctymle, data = dfCrime, n.impute = 30)

n: 90   p: 24   Imputations: 30     nk: 3 

Number of NAs:
  crmrte    urban  central     west    other   prbarr  prbconv  prbpris 
       0        0        0        0        0        0        0        0 
  avgsen    polpc  density    taxpc pctmin80     wcon     wtuc     wtrd 
       0        0        0        0        0        0        0        0 
    wfir     wser     wmfg     wfed     wsta     wloc      mix  pctymle 
       0        1        0        0        0        0        0        0 

         type d.f.
crmrte      s    2
urban       l    1
central     l    1
west        l    1
other       l    1
prbarr      s    2
prbconv     s    2
prbpris     s    2
avgsen      s    2
polpc       s    2
density     s    2
taxpc       s    2
pctmin80    s    2
wcon        s    2
wtuc        s    2
wtrd        s    2
wfir        s    2
wser        s    1
wmfg        s    2
wfed        s    2
wsta        s    2
wloc        s    2
mix         s    2
pctymle     s    2

Transformation of Target Variables Forced to be Linear

R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
 wser 
0.959 
df<-melt(t(impute_arg$imputed$wser), value.name=c("wser"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = wser)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)

We will reassign the value in our dataset to the median from these trials.

dfCrime$wser[which(dfCrime$county==185)]<-median(impute_arg$imputed$wser)
dfCrime$wser[which(dfCrime$county==185)]
[1] 266.7583

Next, let’s examine the tax per capita outlier

dfCrime %>%
filter(taxpc > 100)
  county year    crmrte   prbarr  prbconv  prbpris avgsen      polpc
1     55   87 0.0790163 0.224628 0.207831 0.304348  13.57 0.00400962
    density    taxpc west central urban  pctmin80     wcon     wtuc
1 0.5115089 119.7615    0       0     0 0.0649622 309.5238 445.2762
      wtrd     wfir     wser   wmfg   wfed   wsta   wloc        mix
1 189.7436 284.5933 221.3903 319.21 338.91 361.68 326.08 0.08437271
     pctymle region regcode other nonurban   metro
1 0.07613807      0       O     1        1 Outside

The tax revenue per capita in this county is excessive. There is nothing in the wage variables that would indicate more tax revenues should be captured than what is normal. We will adjust this taxpc data point by replacing it by imputing its value from the sample.

dfCrime$taxpc[which(dfCrime$county==55)]<- NA
impute_arg <- aregImpute(~ crmrte +  urban + central + west + other +
                         prbarr + prbconv + prbpris + avgsen + polpc + 
                         density + taxpc + pctmin80 + wcon + wtuc +
                         wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
                         mix + pctymle, data = dfCrime, n.impute = 30)
Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
Iteration 11 
Iteration 12 
Iteration 13 
Iteration 14 
Iteration 15 
Iteration 16 
Iteration 17 
Iteration 18 
Iteration 19 
Iteration 20 
Iteration 21 
Iteration 22 
Iteration 23 
Iteration 24 
Iteration 25 
Iteration 26 
Iteration 27 
Iteration 28 
Iteration 29 
Iteration 30 
impute_arg

Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~crmrte + urban + central + west + other + 
    prbarr + prbconv + prbpris + avgsen + polpc + density + taxpc + 
    pctmin80 + wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + 
    wsta + wloc + mix + pctymle, data = dfCrime, n.impute = 30)

n: 90   p: 24   Imputations: 30     nk: 3 

Number of NAs:
  crmrte    urban  central     west    other   prbarr  prbconv  prbpris 
       0        0        0        0        0        0        0        0 
  avgsen    polpc  density    taxpc pctmin80     wcon     wtuc     wtrd 
       0        0        0        1        0        0        0        0 
    wfir     wser     wmfg     wfed     wsta     wloc      mix  pctymle 
       0        0        0        0        0        0        0        0 

         type d.f.
crmrte      s    2
urban       l    1
central     l    1
west        l    1
other       l    1
prbarr      s    2
prbconv     s    2
prbpris     s    2
avgsen      s    2
polpc       s    2
density     s    2
taxpc       s    1
pctmin80    s    2
wcon        s    2
wtuc        s    2
wtrd        s    2
wfir        s    2
wser        s    2
wmfg        s    2
wfed        s    2
wsta        s    2
wloc        s    2
mix         s    2
pctymle     s    2

Transformation of Target Variables Forced to be Linear

R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
taxpc 
0.795 
df<-melt(t(impute_arg$imputed$taxpc), value.name=c("taxpc"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = taxpc)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)
dfCrime$taxpc[which(dfCrime$county==55)]<-median(impute_arg$imputed$taxpc)
dfCrime$taxpc[which(dfCrime$county==55)]
[1] 34.48254
#Plot of the criminal justice and law enforcment related variables vs crmrte
q1<-ggplot(data = dfCrime, aes(x = prbarr, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = prbconv, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = prbpris, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q4<-ggplot(data = dfCrime, aes(x = avgsen, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q5<-ggplot(data = dfCrime, aes(x = polpc, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q6<-ggplot(data = dfCrime, aes(x = mix, y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")

grid.arrange(q1, q2, q3, q4, q5, q6, ncol=2)

The criminal justice and law enforcement variables also show evidence of possible outliers, notably, pbarr and polpc appear to have extreme data points

We also see that prbarr and prbconv have values greater than 1. However, these are not true probabity numbers and are instead ratios used as a stand in for the true probability numbers.

There is a possibility of higher arrests per incident for an area. Meaning, the area has low incidents in general but when there were incidents there were also multiple arrests. The same case can be made for the convictions per arrest variable which we see is for a different region. In that county there may have been multiple charges brought per one arrest.

#plot of demographic information for counties Outside and Inside the metro areas
# population density, percent minority, percent young male

q1<-ggplot(data = dfCrime, aes(x = density, y = crmrte, color = region)) + 
      geom_point() + facet_wrap(~ metro) +
  geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = pctmin80, y = crmrte, color = region)) + 
      geom_point() + facet_wrap(~ metro) +
  geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = pctymle, y = crmrte, color = region)) + 
      geom_point()+ facet_wrap(~ metro) +
  geom_smooth(method = "lm")

grid.arrange(q1, q2, q3, ncol=1)

Notably more outliers are observed in demographic information. Here, pctymle in one county outside of a metro area is nearly 25%. That seems quite high in normal statistical measures of the population, however, this can be explained as a county having a large college town population.

Finally, we can see our bright blue region 3 county and notice its population density. Its behavior is more similar to an inside metro area. Than outside. In addition to be coded for both western and central regions, it appears to be miscoded here as well.

We will address the metro variable, and examine whether the region variable should be west, central or other instead of both central and west

dfCrime %>%
filter(west ==1 & central ==1) %>%
select(county, west, central, other, urban, region, regcode, metro)
  county west central other urban region regcode   metro
1     71    1       1     0     0      3      CW Outside
dfCrime$west[which(dfCrime$county==71)]<-NA
dfCrime$central[which(dfCrime$county==71)]<-NA
dfCrime$other[which(dfCrime$county==71)]<-NA
dfCrime$urban[which(dfCrime$county==71)]<-NA
impute_arg <- aregImpute(~ crmrte +  urban + central + west +
                         prbarr + prbconv + prbpris + avgsen + polpc + 
                         density + taxpc + pctmin80 + wcon + wtuc +
                         wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
                         mix + pctymle, data = dfCrime, match="closest", n.impute = 100)
Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
Iteration 11 
Iteration 12 
Iteration 13 
Iteration 14 
Iteration 15 
Iteration 16 
Iteration 17 
Iteration 18 
Iteration 19 
Iteration 20 
Iteration 21 
Iteration 22 
Iteration 23 
Iteration 24 
Iteration 25 
Iteration 26 
Iteration 27 
Iteration 28 
Iteration 29 
Iteration 30 
Iteration 31 
Iteration 32 
Iteration 33 
Iteration 34 
Iteration 35 
Iteration 36 
Iteration 37 
Iteration 38 
Iteration 39 
Iteration 40 
Iteration 41 
Iteration 42 
Iteration 43 
Iteration 44 
Iteration 45 
Iteration 46 
Iteration 47 
Iteration 48 
Iteration 49 
Iteration 50 
Iteration 51 
Iteration 52 
Iteration 53 
Iteration 54 
Iteration 55 
Iteration 56 
Iteration 57 
Iteration 58 
Iteration 59 
Iteration 60 
Iteration 61 
Iteration 62 
Iteration 63 
Iteration 64 
Iteration 65 
Iteration 66 
Iteration 67 
Iteration 68 
Iteration 69 
Iteration 70 
Iteration 71 
Iteration 72 
Iteration 73 
Iteration 74 
Iteration 75 
Iteration 76 
Iteration 77 
Iteration 78 
Iteration 79 
Iteration 80 
Iteration 81 
Iteration 82 
Iteration 83 
Iteration 84 
Iteration 85 
Iteration 86 
Iteration 87 
Iteration 88 
Iteration 89 
Iteration 90 
Iteration 91 
Iteration 92 
Iteration 93 
Iteration 94 
Iteration 95 
Iteration 96 
Iteration 97 
Iteration 98 
Iteration 99 
Iteration 100 
Iteration 101 
Iteration 102 
Iteration 103 
impute_arg

Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~crmrte + urban + central + west + prbarr + 
    prbconv + prbpris + avgsen + polpc + density + taxpc + pctmin80 + 
    wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + wsta + wloc + 
    mix + pctymle, data = dfCrime, n.impute = 100, match = "closest")

n: 90   p: 23   Imputations: 100    nk: 3 

Number of NAs:
  crmrte    urban  central     west   prbarr  prbconv  prbpris   avgsen 
       0        1        1        1        0        0        0        0 
   polpc  density    taxpc pctmin80     wcon     wtuc     wtrd     wfir 
       0        0        0        0        0        0        0        0 
    wser     wmfg     wfed     wsta     wloc      mix  pctymle 
       0        0        0        0        0        0        0 

         type d.f.
crmrte      s    2
urban       l    1
central     l    1
west        l    1
prbarr      s    2
prbconv     s    2
prbpris     s    2
avgsen      s    2
polpc       s    2
density     s    2
taxpc       s    2
pctmin80    s    2
wcon        s    2
wtuc        s    2
wtrd        s    2
wfir        s    2
wser        s    2
wmfg        s    2
wfed        s    2
wsta        s    2
wloc        s    2
mix         s    2
pctymle     s    2

Transformation of Target Variables Forced to be Linear

R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
  urban central    west 
  0.961   0.900   0.951 
df<-melt(t(impute_arg$imputed$central), value.name=c("central"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = central)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)
df<-melt(t(impute_arg$imputed$west), value.name=c("west"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = west)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)
df<-melt(t(impute_arg$imputed$urban), value.name=c("urban"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = urban)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)

The results confirm the county is urban. It is also highly probable that county 71 is not west and is most associated with central. After correcting our data for urban and west, let’s compare ‘central’ with ‘other’ to be certain we have the right region.

dfCrime$urban[which(dfCrime$county==71)]<-median(impute_arg$imputed$urban)
dfCrime$nonurban[which(dfCrime$county==71)]<-(1-median(impute_arg$imputed$urban))
dfCrime$west[which(dfCrime$county==71)]<-median(impute_arg$imputed$west)
impute_arg <- aregImpute(~ crmrte +  urban + central + other +
                         prbarr + prbconv + prbpris + avgsen + polpc + 
                         density + taxpc + pctmin80 + wcon + wtuc +
                         wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
                         mix + pctymle, data = dfCrime, match="closest", n.impute = 100)
Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
Iteration 11 
Iteration 12 
Iteration 13 
Iteration 14 
Iteration 15 
Iteration 16 
Iteration 17 
Iteration 18 
Iteration 19 
Iteration 20 
Iteration 21 
Iteration 22 
Iteration 23 
Iteration 24 
Iteration 25 
Iteration 26 
Iteration 27 
Iteration 28 
Iteration 29 
Iteration 30 
Iteration 31 
Iteration 32 
Iteration 33 
Iteration 34 
Iteration 35 
Iteration 36 
Iteration 37 
Iteration 38 
Iteration 39 
Iteration 40 
Iteration 41 
Iteration 42 
Iteration 43 
Iteration 44 
Iteration 45 
Iteration 46 
Iteration 47 
Iteration 48 
Iteration 49 
Iteration 50 
Iteration 51 
Iteration 52 
Iteration 53 
Iteration 54 
Iteration 55 
Iteration 56 
Iteration 57 
Iteration 58 
Iteration 59 
Iteration 60 
Iteration 61 
Iteration 62 
Iteration 63 
Iteration 64 
Iteration 65 
Iteration 66 
Iteration 67 
Iteration 68 
Iteration 69 
Iteration 70 
Iteration 71 
Iteration 72 
Iteration 73 
Iteration 74 
Iteration 75 
Iteration 76 
Iteration 77 
Iteration 78 
Iteration 79 
Iteration 80 
Iteration 81 
Iteration 82 
Iteration 83 
Iteration 84 
Iteration 85 
Iteration 86 
Iteration 87 
Iteration 88 
Iteration 89 
Iteration 90 
Iteration 91 
Iteration 92 
Iteration 93 
Iteration 94 
Iteration 95 
Iteration 96 
Iteration 97 
Iteration 98 
Iteration 99 
Iteration 100 
Iteration 101 
Iteration 102 
Iteration 103 
impute_arg

Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~crmrte + urban + central + other + prbarr + 
    prbconv + prbpris + avgsen + polpc + density + taxpc + pctmin80 + 
    wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + wsta + wloc + 
    mix + pctymle, data = dfCrime, n.impute = 100, match = "closest")

n: 90   p: 23   Imputations: 100    nk: 3 

Number of NAs:
  crmrte    urban  central    other   prbarr  prbconv  prbpris   avgsen 
       0        0        1        1        0        0        0        0 
   polpc  density    taxpc pctmin80     wcon     wtuc     wtrd     wfir 
       0        0        0        0        0        0        0        0 
    wser     wmfg     wfed     wsta     wloc      mix  pctymle 
       0        0        0        0        0        0        0 

         type d.f.
crmrte      s    2
urban       l    1
central     l    1
other       l    1
prbarr      s    2
prbconv     s    2
prbpris     s    2
avgsen      s    2
polpc       s    2
density     s    2
taxpc       s    2
pctmin80    s    2
wcon        s    2
wtuc        s    2
wtrd        s    2
wfir        s    2
wser        s    2
wmfg        s    2
wfed        s    2
wsta        s    2
wloc        s    2
mix         s    2
pctymle     s    2

Transformation of Target Variables Forced to be Linear

R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
central   other 
  0.937   0.971 
df<-melt(t(impute_arg$imputed$central), value.name=c("central"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = central)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)
median(impute_arg$imputed$central)
[1] 0.5
df<-melt(t(impute_arg$imputed$other), value.name=c("other"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = other)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)

Assign our new values

dfCrime$central[which(dfCrime$county==71)]<-median(impute_arg$imputed$central)
dfCrime$other[which(dfCrime$county==71)]<-median(impute_arg$imputed$other)

Recode the categories for region and metro

dfCrime$region <- case_when (
            (dfCrime$central ==0 & dfCrime$west ==0) ~ 0, #Eastern, Coastal, Other
            (dfCrime$central ==0 & dfCrime$west ==1) ~ 1, #Western
            (dfCrime$central ==1 & dfCrime$west ==0) ~ 2  #Central
        )
dfCrime$regcode =
            factor( dfCrime$region , levels = 0:2 , labels =
                    c( 'O',
                       'W',
                       'C' )
                   )
dfCrime$metro =
            factor( dfCrime$urban , levels = 0:1 , labels =
                    c( 'Outside',
                       'Inside'
                      )
                   )
dfCrime %>%
filter(county == 71) %>%
select(county, west, central, urban, region, regcode, metro)
  county west central urban region regcode  metro
1     71    0     0.5     1     NA    <NA> Inside

Let’s review our density numbers again by looking in more detail at its distribution.

#options(repr.plot.width=8, repr.plot.height=4)
ggplot(data = dfCrime, aes(x = density)) + 
      geom_histogram(bins=90)

We note that one of the counties has an extremely low density. Near zero.

dfCrime %>%
filter(density < 0.01)
  county year    crmrte   prbarr  prbconv prbpris avgsen      polpc
1    173   87 0.0139937 0.530435 0.327869    0.15   6.64 0.00316379
      density    taxpc west central urban pctmin80    wcon     wtuc
1 2.03422e-05 37.72702    1       0     0 0.253914 231.696 213.6752
      wtrd    wfir     wser   wmfg   wfed   wsta   wloc       mix
1 175.1604 267.094 204.3792 193.01 334.44 414.68 304.32 0.4197531
     pctymle region regcode other nonurban   metro
1 0.07462687      1       W     0        1 Outside

In review of the North Carolina county density data from 1985, the smallest population density in any county in North Carolina is 0.0952. This makes the density of 0.0000203422 for county 173 statistically impossible. It is miscoded.

http://ncosbm.s3.amazonaws.com/s3fs-public/demog/dens7095.xls

(Note to team: We could use this table if we want to assign names to our counties by comparing the population densities. What is interesting is that the 6 rows of missing values we removed earlier can be found in the tail of this table. There was an arbitrary cut off after a certain density - lkely because the counties were not statistically significant. County 173 is not one of those counties, however, as our imputation process will demonstrate.)

dfCrime$density[which(dfCrime$county==173)]<- NA
dfSubset <- dfCrime %>% filter(urban==0 & west ==1) #we will use the other non-urban western counties to find a replacement

impute_arg <- aregImpute(~ crmrte + # urban + central + west + other + # region variables become constant and no longer apply
                         prbarr + prbconv + prbpris + avgsen + polpc + 
                         density + taxpc + pctmin80 + wcon + wtuc +
                         wtrd + wfir + wser + wmfg + wfed + wsta + wloc +
                         mix + pctymle, data = dfSubset, match="closest", n.impute = 30)
Iteration 1 
Iteration 2 
Iteration 3 
Iteration 4 
Iteration 5 
Iteration 6 
Iteration 7 
Iteration 8 
Iteration 9 
Iteration 10 
Iteration 11 
Iteration 12 
Iteration 13 
Iteration 14 
Iteration 15 
Iteration 16 
Iteration 17 
Iteration 18 
Iteration 19 
Iteration 20 
Iteration 21 
Iteration 22 
Iteration 23 
Iteration 24 
Iteration 25 
Iteration 26 
Iteration 27 
Iteration 28 
Iteration 29 
Iteration 30 
impute_arg

Multiple Imputation using Bootstrap and PMM

aregImpute(formula = ~crmrte + prbarr + prbconv + prbpris + avgsen + 
    polpc + density + taxpc + pctmin80 + wcon + wtuc + wtrd + 
    wfir + wser + wmfg + wfed + wsta + wloc + mix + pctymle, 
    data = dfSubset, n.impute = 30, match = "closest")

n: 20   p: 20   Imputations: 30     nk: 3 

Number of NAs:
  crmrte   prbarr  prbconv  prbpris   avgsen    polpc  density    taxpc 
       0        0        0        0        0        0        1        0 
pctmin80     wcon     wtuc     wtrd     wfir     wser     wmfg     wfed 
       0        0        0        0        0        0        0        0 
    wsta     wloc      mix  pctymle 
       0        0        0        0 

         type d.f.
crmrte      s    2
prbarr      s    2
prbconv     s    2
prbpris     s    2
avgsen      s    2
polpc       s    2
density     s    1
taxpc       s    2
pctmin80    s    2
wcon        s    2
wtuc        s    2
wtrd        s    2
wfir        s    2
wser        s    2
wmfg        s    2
wfed        s    2
wsta        s    2
wloc        s    2
mix         s    2
pctymle     s    2

Transformation of Target Variables Forced to be Linear

R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
density 
      1 
df<-melt(t(impute_arg$imputed$density), value.name=c("density"))
options(repr.plot.width=4, repr.plot.height=2)
ggplot(data = df, aes(x = density)) + geom_histogram(bins=30)

options(repr.plot.width=NULL, repr.plot.height=NULL)
dfCrime$density[which(dfCrime$county==173)]<-median(impute_arg$imputed$density)
dfCrime$density[which(dfCrime$county==173)]
[1] 1.505422

Now, we will examine histograms for the remaining variables

dfEconVars <- as.data.frame(cbind(dfCrime$wcon, dfCrime$wtuc, dfCrime$wtrd, dfCrime$wfir, dfCrime$wser, dfCrime$wmfg, dfCrime$wfed, dfCrime$wsta, dfCrime$wloc))
names(dfEconVars) <- c('wcon', 'wtuc', 'wtrd', 'wfir', 'wser', 
                              'wmfg', 'wfed', 'wsta', 'wloc')

ggplot(melt(dfEconVars),aes(x=value)) + geom_histogram(bins=30) + facet_wrap(~variable)
No id variables; using all as measure variables

Each histogram for the wage information looks evenly distributed. We have no further remark at this time. We move to the justice an law enforcement variables. With these variables being mostly < 1 we’ll also take the log for comparison.

q1<-ggplot(data = dfCrime, aes(x = prbarr)) + 
      geom_histogram(bins=30)
q11<-ggplot(data = dfCrime, aes(x = log(prbarr))) + 
      geom_histogram(bins=30) 
            
q2<-ggplot(data = dfCrime, aes(x = prbconv)) + 
      geom_histogram(bins=30)
q21<-ggplot(data = dfCrime, aes(x = log(prbconv))) + 
      geom_histogram(bins=30)

q3<-ggplot(data = dfCrime, aes(x = prbpris)) + 
      geom_histogram(bins=30)
q31<-ggplot(data = dfCrime, aes(x = log(prbpris))) + 
      geom_histogram(bins=30)

q4<-ggplot(data = dfCrime, aes(x = avgsen)) + 
      geom_histogram(bins=30)
q41<-ggplot(data = dfCrime, aes(x = log(avgsen))) + 
      geom_histogram(bins=30)

q5<-ggplot(data = dfCrime, aes(x = polpc)) + 
      geom_histogram(bins=30)
q51<-ggplot(data = dfCrime, aes(x = log(polpc))) + 
      geom_histogram(bins=30)

q6<-ggplot(data = dfCrime, aes(x = mix)) + 
      geom_histogram(bins=30)
q61<-ggplot(data = dfCrime, aes(x = log(mix))) + 
      geom_histogram(bins=30)


grid.arrange(q1, q11, q2, q21, q3, q31, q4, q41, q5, q51, q6, q61, ncol=2)

The log transformation for these variables makes them more evenly distributed. We will transform these variables to their log equivalents and confirm with plots to see whether the result shows more linearity.

#Plot of the criminal justice and law enforcment related variables vs crmrte
q1<-ggplot(data = dfCrime, aes(x = log(prbarr), y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q2<-ggplot(data = dfCrime, aes(x = log(prbconv), y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q3<-ggplot(data = dfCrime, aes(x = log(prbpris), y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q4<-ggplot(data = dfCrime, aes(x = log(avgsen), y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q5<-ggplot(data = dfCrime, aes(x = log(polpc), y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")
q6<-ggplot(data = dfCrime, aes(x = log(mix), y = crmrte, color = region)) + 
      geom_point()+
  geom_smooth(method = "lm")

grid.arrange(q1, q2, q3, q4, q5, q6, ncol=2)

Of the six variables, only prbarr, prbconv and polpc show univariate correlation with crime. We believe these will be better candidates for our model selection. Further, we see mix has no correlation with crmrate and may be its own outcome variable.

dfCrime$logprbarr <- log(dfCrime$prbarr)
dfCrime$logprbconv <- log(dfCrime$prbconv)
dfCrime$logprbpris <- log(dfCrime$prbpris)
dfCrime$logavgsen <- log(dfCrime$avgsen)
dfCrime$logpolpc <- log(dfCrime$polpc)
dfCrime$logmix <- log(dfCrime$mix)

Next we take a look at the demographic histograms and their log alternatives

q1<-ggplot(data = dfCrime, aes(x = pctymle)) + 
      geom_histogram(bins=30)
q11<-ggplot(data = dfCrime, aes(x = log(pctymle))) + 
      geom_histogram(bins=30) 
            
q2<-ggplot(data = dfCrime, aes(x = pctmin80)) + 
      geom_histogram(bins=30)
q21<-ggplot(data = dfCrime, aes(x = log(pctmin80))) + 
      geom_histogram(bins=30)

q3<-ggplot(data = dfCrime, aes(x = density)) + 
      geom_histogram(bins=30)
q31<-ggplot(data = dfCrime, aes(x = log(density))) + 
      geom_histogram(bins=30)


grid.arrange(q1, q11, q2, q21, q3, q31, ncol=2)

The shape after transformation make the data more distributed. We will include transfomations of these variables as well.

dfCrime$logdensity <- log(dfCrime$density)
dfCrime$logtaxpc <- log(dfCrime$taxpc)
dfCrime$logpctmin80 <- log(dfCrime$pctmin80)
dfCrime$logpctymle <- log(dfCrime$pctymle)

Finally, we’ll take a look at taxpc and the crmrte variable itself.

q1<-ggplot(data = dfCrime, aes(x = taxpc)) + 
      geom_histogram(bins=30)
q11<-ggplot(data = dfCrime, aes(x = log(taxpc))) + 
      geom_histogram(bins=30) 
            
q2<-ggplot(data = dfCrime, aes(x = crmrte)) + 
      geom_histogram(bins=30)
q21<-ggplot(data = dfCrime, aes(x = log(crmrte))) + 
      geom_histogram(bins=30)

grid.arrange(q1, q11, q2, q21, ncol=2)

The crmrte and taxpc variables are more evenly distributed after transformation. We’ll add those to our dataframe.

dfCrime$logcrmrte = log(dfCrime$crmrte)
dfCrime4logtaxpc = log(dfCrime$taxpc)

As a final point of discussion we will identify additional variables we wish to operationalize for use in our models. The include a variable that expresses the economic condition of the county and a variable that expresses criminal justice effectiveness.

The first variable on the economic condition will include the sum of all average weekly wages from the 1980 census information. Since we do not know how many were employed at that wage we use this summary the best available proxy.

dfCrime$allWages<-dfCrime$wcon + dfCrime$wtuc + dfCrime$wtrd + dfCrime$wfir +
    dfCrime$wser + dfCrime$wmfg + dfCrime$wfed + dfCrime$wsta + dfCrime$wloc

As a second variable, we are interested in understanding the effectiveness of the criminal justice system as a crime deterrent. Our proxy will be the number of convictions per incident.

This is operationalized by taking the probability of arrests, pbrarr (which is defined as arrests per incident) and multiplying by the probability of convictions, pbrconv (which is defined as convictions per arrest). The new variable is defined below.

dfCrime$crimJustEff<-dfCrime$prbarr * dfCrime$prbconv

We will also create a logarithmic transformation of this variable based on our histogram analysis from before.

dfCrime$logcrimJustEff<-log10(dfCrime$crimJustEff)

Summary and Results

Our outcome variable is the crime rate (“crmrte”), which is defined as the crimes committed per person in a specific county during 1987. The crime rate of the 90 counties in our sample dataset range between 0.0055 - 0.0990, with a mean of 0.0335.

From the boxplot below, most of the counties have a crime rate between 0.0055 and 0.0700, with 5 outliers having a crime rate > 0.0700.

options(repr.plot.width=4, repr.plot.height=4)
ggplot(data = dfCrime, aes(y = crmrte)) + 
      geom_boxplot()

While mix (the type of crime committed) is also potentially an outcome variable, our research focuses on providing policy recommendations to reduce crime in general and not a specific type of crime. Mix is also not a linear outcome and hence difficult to measure.

We propose 3 multiple linear regression models

  • First Model: Has only the explanatory variables of key interest and no other covariates.

  • Second Model: Includes the explanatory variables and covariates that increase the accuracy of our results without substantial bias.

  • Third Model: An expansion of the second model with most covariates, designed to demonstrate the robustness of our results to model specification.

As we proceed with each model, we verify the CLM assumptions of OLS are addressed below: * MLR1 Linear in parameters: The models have had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear as demonstrated by the scatterplot below. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error. * MLR5’ Spherical errors: There is homoscedasticity and no autocorrelation [TBD]. * MLR6’ Our error terms should be normally distributed [TBD].

By satisfying these assumptions, we can expect our coefficients will be approaching the true parameter values in probability.

Evidence of multi-collinearity (or perfect collinearity)?

options(repr.plot.width=8, repr.plot.height=8)
pairs(~ logcrimJustEff + logpolpc + allWages + logtaxpc + logdensity + logpctmin80 + logpctymle,data=dfCrime, main="Scatterplot Matrix of Model Variables")

Model 1

Introduction

Our base hypothesis is that crime can be fundamentally explained by two factors: the effectiveness of the criminal justice system and the economic conditions.

Criminal Justice Effectiveness is self defined : To be able to track crimes, they must be reported to police, who can then make arrests and the legal system provides judgement (convictions/sentencing) Criminal justice also has a relationship to crime as a deterrent, as the probability of getting caught, convicted, sentenced could potentially deter crime.

We operationalize criminal justice effectiveness as follows: probability of Convictions * Crimes committed. We define as: prbconv * prbarr = conv/arrest * arrest/crime = convictions/crime. Without more granular data, this provides a single parsimonious metric that helps understand how the law enforcement and criminal justice system works.

Model 1 EDA

Data Transformations

options(repr.plot.width=4, repr.plot.height=4)
hist((dfCrime$prbconv))

hist((dfCrime$prbarr))

The distribution of both probability of conviction and probability of arrest are peculiar and non-normal. It could be argued that both of these variables should be bound between 0 and 1. However, “probability” of conviction is proxied by a ratio of convictions to arrests. It is in fact common that defendents are charged with multiple crimes and convicted, but were only arrested once.

For “probability” of arrest, it could be possible there are multiple arrests for a single offense. However, the single data point that is greater than one, is >3 standard deviations away from the distribution. This outlier will have high leverage on our model and will be preemptively removed as the data supplied is likely in error and is not representative of the bulk of North Carolina counties.

For parsimony, we can simply the probability of arrest and probability of conviction by multiplying to effectively get the ratio of convictions to offenses. The normality of this factor can be improved by taking a log transform. QQ plots help to visualize how normality improves for the inner quartiles.

(dfCrime[51,]$prbarr - mean(dfCrime$prbarr))/sd(dfCrime$prbarr) # how many standard deviations away the outlier lies
[1] 5.779438
#hist(log(dfCrime$crimJustEff))
ggplot(data=dfCrime, aes(sample= crimJustEff)) + stat_qq() + stat_qq_line() + ggtitle("QQ Plot of Crim Just Eff")

dfCrime <- dfCrime[dfCrime$crimJustEff < 1,] # removing high flying outlier
ggplot(data=dfCrime, aes(sample= crimJustEff)) + stat_qq() + stat_qq_line() + ggtitle("QQ Plot of Crim Just Eff")

ggplot(data=dfCrime, aes(sample= logcrimJustEff)) + stat_qq() + stat_qq_line() + 
ggtitle("QQ Plot of log transformed Crim Just Eff")

## Can show histogram/qqplot side by side in RMD. 

We theorize that the second major cause of crime are bad economic conditions. When there are worse economic conditions, crime can be more attractive due to:

  • Lack of means: People forced into crimes because they need to make ends meet
  • Lack of occupation: People commit crimes because they are not busy at work
  • Lack of opportunity: High discount rate for future due to no long-term opportunity, incentive to take the risk and commit crimes hoping for big payoff.

We operationalize economic conditions by looking at wages. For this model, we define this as the sum of all average wages in each county. We think this is best proxy from our data because it answers all of the above (higher wages leads to better means and better opportunities). From our EDA we also confirm that in general these sums are not skewed by having 1 really high paying sector in each county as we see a strong relationship between avg quartile across all job types and total sum. This can be seen in the chart below.

# # Quantiles for all jobs
dfWage<-mutate(dfCrime,qCon=ntile(dfCrime$wcon,4))
dfWage<-mutate(dfWage,qTuc=ntile(dfCrime$wtuc,4))
dfWage<-mutate(dfWage,qTrd=ntile(dfCrime$wtrd,4))
dfWage<-mutate(dfWage,qFir=ntile(dfCrime$wfir,4))
dfWage<-mutate(dfWage,qSer=ntile(dfCrime$wser,4))
dfWage<-mutate(dfWage,qMfg=ntile(dfCrime$wmfg,4))
dfWage<-mutate(dfWage,qFed=ntile(dfCrime$wfed,4))
dfWage<-mutate(dfWage,qSta=ntile(dfCrime$wsta,4))
dfWage<-mutate(dfWage,qLoc=ntile(dfCrime$wloc,4))
## Average quantile
dfWage$qAvg= (dfWage$qCon+dfWage$qTuc+dfWage$qTrd+dfWage$qFir+dfWage$qSer+dfWage$qMfg+dfWage$qFed+dfWage$qSta+dfWage$qLoc)/9
plot(dfCrime$allWages,dfWage$qAvg)

hist(dfCrime$allWages)

ggplot(data=dfCrime, aes(sample= allWages)) + stat_qq() + stat_qq_line() + ggtitle("QQ Plot of sum of wages")

Model 1 Linear Model

mod1 <- lm(dfCrime$logcrmrte ~ dfCrime$allWages + dfCrime$logcrimJustEff)
(mod1)

Call:
lm(formula = dfCrime$logcrmrte ~ dfCrime$allWages + dfCrime$logcrimJustEff)

Coefficients:
           (Intercept)        dfCrime$allWages  dfCrime$logcrimJustEff  
            -6.3008940               0.0006381              -1.0014763  
summary(mod1)$adj.r.square
[1] 0.4559566
## will be details on effect size and standard error as we cover this in class.
plot(mod1, which=5)

plot(mod1, which=2)

plot(mod1, which=3)

plot(mod1, which=1)

The model shows a moderate good fit, with an adjusted R square of 0.46. This can be interpreted as, the model explains 46% of the variation in crime. Next the model is plotted in a Residuals vs Leverage plot. This plot shows that all the points have a cook’s distance of less than 0.5. There are no points that have enough leverage and residual than when deleted greatly alter the model coefficients.

The root of standardized residuals all fall within about 1.6. This is very good, as we can expect 95% of the points to fall within 3 standardized residuals of each other. (\(\sqrt(3) \approx 1.73\))

Finally, the residuals vs fitted plot shows a well centered and mostly nromal distribution about 0. There are no major trends or variation changes across the fitted values. This suggests that major uncorrelated variables have not been left out of the model. We will discuss the possible ommited variable biases further, in the next sections.

Model 1 CLM Assumptions: [To be finalized] * MLR1 Linear in paramters: The model has had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear as the economy and criminal justice effectiveness are independent. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error.

By satisfying these assumptions, we can expect that our coefficients are approaching the true parameter values in probability.

##MLR 5,6 to be discussed in week 13…?

cov(resid(mod1), dfCrime$allWages)
[1] 2.778984e-14
cov(resid(mod1), log(dfCrime$crimJustEff))
[1] -2.644899e-18
mean(resid(mod1))
[1] -5.575418e-18

Model 2

Introduction

In this model, we introduce the additional covariates of population per square mile (density), tax per capita (taxpc) and police per capita (polpc) to increase the accuracy of our regression. We are including these additional variables to our second model, as they add accuracy to the explanatory variables used in our first model:

  1. The DENSITY of an area can have significant impacts on:
    • Criminal Justice Effectiveness: with more people in a given area, crime frequency increases (+ bias direction). However, more people means there are more potential witnesses, making it easier to catch criminals (- bias direction).
    • Economic Opportunity (ie. AllWages): in high density areas, there is an increase in demand for support services such as food, retail, utilities, etc. As a result, there is a high demand for service jobs, which increases the economic opportunities within the area (+ bias direction). However, more people in a given area, there is a closer proximity to drugs, alcohol and gang violence - all of which are inhimitors to better economic outcomes.
  2. The Police Per Capita in a county can be influential on the Criminal Justice Effectiveness. With more police in a given area, one would think that crime rates would decrease, however our correlation plot below tells a different story. Including this variable in our analysis will give us more insight into the variables used in model 1.
  3. The Tax Per Capita can have a direct impact on the Police Per Capita. A higher tax per capita, means that the county has more tax dollars to spend on protection services (ie. increasing the number of police in the county).

\[log(crmrate) = \beta_0 + \beta_1crimjusteff + \beta_2log(polpc) + \beta_3density + \beta_4allWages + \beta_5taxpc + u\]

Model 2 EDA and Data Transformations

corrplot(cor(dfCrime[,c("logcrmrte", "logcrimJustEff", "logprbarr", "logprbconv", "prbpris", "polpc", "taxpc", "allWages", "urban", "density", "pctymle", "pctmin80")]),method='circle', type = 'lower')

par(mfrow = c(2,2))
hist(dfCrime$polpc, breaks=25)
hist(dfCrime$taxpc, breaks=25)
hist(dfCrime$density, breaks=25)

par(mfrow = c(3,2))
hist(dfCrime$polpc, main="Hist of polpc")
hist(dfCrime$logpolpc, main="Hist of Log10 logpolpc")
hist(dfCrime$taxpc, main="Hist of taxpc")
hist(dfCrime$logtaxpc, main="Hist of Log10 logtaxpc")
hist(dfCrime$density, main="Hist of density")
hist(dfCrime$logdensity, main="Hist of Log10 logdensity")

# par(mfrow = c(2,2))
# plot(dfCrime$logcrimJustEff, dfCrime$polpc, main = 'polpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='polpc')
# plot(dfCrime$logcrimJustEff, dfCrime$logpolpc, main = 'logpolpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='logpolpc')
# plot(dfCrime$logcrimJustEff, dfCrime$taxpc, main = 'taxpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='taxpc')
# plot(dfCrime$logcrimJustEff, dfCrime$logtaxpc, main = 'logtaxpc vs logcrimJustEff', xlab='logcrimJustEff', ylab='logtaxpc')

In the histograms above, we see that the both polpc and taxpc exhibit right skew. Taking the \(log_{10}\) of polpc brings the distribution closer to normal. However, the \(log\) of taxpc and density makes the distributions even more skewed.

As a result, we will use the \(log\) of polpc (logpolpc) in our second model and will not transform the taxpc and density variables.

Model 2 Linear Model

model2 <- lm(logcrmrte ~ logcrimJustEff + logpolpc + allWages + taxpc + density, data = dfCrime)
model2

Call:
lm(formula = logcrmrte ~ logcrimJustEff + logpolpc + allWages + 
    taxpc + density, data = dfCrime)

Coefficients:
   (Intercept)  logcrimJustEff        logpolpc        allWages  
    -2.2783403      -0.6345041       0.3984752       0.0002599  
         taxpc         density  
    -0.0043953       0.1118637  
summary(model2)

Call:
lm(formula = logcrmrte ~ logcrimJustEff + logpolpc + allWages + 
    taxpc + density, data = dfCrime)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.11981 -0.17313 -0.02517  0.27332  0.65535 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -2.2783403  1.1634826  -1.958 0.053565 .  
logcrimJustEff -0.6345041  0.1810816  -3.504 0.000741 ***
logpolpc        0.3984752  0.1396949   2.852 0.005474 ** 
allWages        0.0002599  0.0001548   1.679 0.096908 .  
taxpc          -0.0043953  0.0045553  -0.965 0.337412    
density         0.1118637  0.0379080   2.951 0.004117 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3541 on 83 degrees of freedom
Multiple R-squared:  0.5669,    Adjusted R-squared:  0.5408 
F-statistic: 21.73 on 5 and 83 DF,  p-value: 7.614e-14

The Adjusted R-squared variable penalizes for additional variables, which means there is a chance that this value will decrease if the added variables do not contribute to the model. By comparing the Adjusted R-squared value between our first and second models, we see that log(polpc), taxpc and density help describe log(crmrate). Our second model has an Adjusted R-squared value of 0.5004, which means 50.04% of the variation in the \(log_{10}\) of crime rate is explained by the explanatory variables used in this model. This is a significant increase compared to our first model, that has an Adjusted R-squared value of 0.4520.

In addition, the F-statistic is 16.62 with a statistically significant p-value of < 6.263e-11. As a result, we reject the null hypothesis that none of the independent variables help to describe log(crmrate).

Coefficient Analysis (assuming ceterus paribus): - logcrimJustEff: -0.1607. This suggests that for a 1% increase in criminal justice efficiency, there is a 0.1607% decrease in crime rate. - logpolpc: 0.3701. This suggests that for a 1% increase in police per capita, there is a 0.3701% increase in crime rate. - allWages: 0.00006692. This suggests that for a 1% increase in total average weekly wage, there is a 0.0067% increase in crime rate. - taxpc: -0.001632. This suggests that for a 1% increase in tax per capita, there is a 0.1632% decrease in crime rate. - density: 0.06259. This suggests that for a 1% increase in density, there is a 6.259% increase in crime rate.

Results - WIP

  • Standard Errors explanation will go here. Placeholder cell for now.

Model 2 CLM Assumptions: [To be Finalized] * MLR1 Linear in paramters: The model has had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear as the economy and criminal justice effectiveness are independent. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error.

By satisfying these assumptions, we can expect that our coefficients are approaching the true parameter values in probability.

##MLR 5,6 to be discussed in week 13…?

Conclusion : Are the conclusions they draw based on this evaluation appropriate? Did the team interpret the results in terms of their research question?

Compared to model 1, the adjusted \(R^2\) of model 2 is only marginally higher. This suggests that we should continue our analysis by focusing on the join significance of the variables added in model 2.

Model 3

Introduction

Despite the improvements in the accuracy of model 2 over model 1, we are still only explaining about 55% of the variation in our data. As a result, we propose to also analyse the topic of demographics which could have an effec on both of our key explanatory variables.

One key component of demographics is the race of the county inhabitants and how they are perceived and treated by others, especially for minorities in the population. For example, systemic racism could have an important effect on: * Criminal Justice Effectiveness: If police, lawyers and judges are racially biased, this could lead to more arrests and more convictions regardless of the strength of the legal case and the evidence. As a result, we hypothesize the crime rate would increase. * Economic Opportunity: Racism could prohibit members of the minority from having access to education, jobs and higher wages. Racism could also limit access to healthcare and social programmes which has a negative effect on economic opportunity.

However, since we cannot directly measure racism, we have to operationalize this covariate by examining its effect in the real world. We propose to use the variable pctmin80, which represents the percentage of minorities in the population of the county. This is a good indicator that is also a linear parameter: given a higher the percentage of minorities, we should expect to see a greater effect.

We propose to operationalize gender and age with the variable

We have also chosen not to include other variables from our dataset in our model: * Region: While geographical indicators are also important, particularly as they may represent clusters of jobs and skilled workers, it is not a linear parameter (i.e. we can not simply increase a region by “1” and expect to see an effect on the crime rate.“) * Urban: We believe the variable”density" better explains the same effects as “urban”, while also being a linear parameter. In addition, there may be data points that failed to meet the cutoff for being defined as urban, but may still see the same effects as being urban and hence may distort our analysis. * Age and Gender: While age and gender are important demographic variables, the only variable in our dataset is pctymle which provides the percentage of young males in the population. However, given that this variable encompasses both male and young, we may not be able to discern if age or gender has the larger effect (if any at all).

Model 3 EDA and Data Transformations

Percentage Minority: From the summary and boxplot below, we can see that the percentage of minorities ranges from 0.0154 - 0.6435, with a mean of 0.2621. We note that there are no major outliers.

In addition from the scatterplots below, we see that using applying log10 on pctmin80 exposes a more linear relationship with the points more balanced on either side of the trendline. As a result, we will use the log-transformed version of pctmin80.

summary(dfCrime$pctmin80)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01541 0.10084 0.25391 0.25987 0.38223 0.64348 
boxplot(dfCrime$pctmin80)

plot(dfCrime$pctmin80, dfCrime$logcrmrte)
abline(lm(dfCrime$logcrmrte~dfCrime$pctmin80))

plot(dfCrime$logpctmin80, dfCrime$logcrmrte)
abline(lm(dfCrime$logcrmrte~dfCrime$logpctmin80))

Model 3 Linear Model

model3<-lm(logcrmrte ~ logcrimJustEff + logpolpc + allWages + taxpc + density + logpctmin80 , data = dfCrime)
summary(model3)

Call:
lm(formula = logcrmrte ~ logcrimJustEff + logpolpc + allWages + 
    taxpc + density + logpctmin80, data = dfCrime)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79245 -0.12650  0.01757  0.14956  0.68936 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -2.0447465  0.9226341  -2.216 0.029448 *  
logcrimJustEff -0.8668274  0.1472063  -5.889 8.24e-08 ***
logpolpc        0.4066755  0.1107124   3.673 0.000426 ***
allWages        0.0003211  0.0001230   2.611 0.010732 *  
taxpc          -0.0081348  0.0036484  -2.230 0.028503 *  
density         0.0899602  0.0302004   2.979 0.003806 ** 
logpctmin80     0.2399710  0.0338834   7.082 4.44e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2806 on 82 degrees of freedom
Multiple R-squared:  0.7313,    Adjusted R-squared:  0.7116 
F-statistic:  37.2 on 6 and 82 DF,  p-value: < 2.2e-16

The model shows a good fit, with an adjusted R-squared of 0.7322, meaning that the model explains 73% of the variation in crime.

For all of our 6 different independent variables, we note each of them have statistical significance at the 95% level or better. Of these 6, criminal justice efficiency, minority percentages and density are the most significant.

Interpretation of coefficients (Assuming ceterus paribus):

Positive coefficients: * Police presence: If we increase police per capita by 1 unit, we expect the crime rate to increase by 33%. * AllWages: If we increase wages by 1 dollar, we expect the crime rate to increase by 0.01% * Density: If we increase density by 1 unit, we expect the crime rate to increase by 5% * Percentage of minorities: If the percentage of minorities increase by 1%, we expect the crime rate to increase by 0.24%

Negative coefficients: * Criminal justice efficiency: If we increase the criminal justice efficiency by 1%, we expect the crime rate to decrease by 0.34%. * Tax per capita: If we increase tax per capita by 1 unit, we expect the crime rate to decrease by 0.35%

In addition, the F-statistic is 40.04 with a statistically significant p-value of < 2.2e-11. As a result, we reject the null hypothesis that none of the independent variables help to describe log(crmrate).

[Comment on standard error]

In the Residuals vs Leverage plot below, all the points have a cook’s distance of less than 0.5. While there is a point with 0.6 leverage, there are no points that have residual that greatly alter the model coefficients.

The root of standardized residuals all fall within about 1.6. This is very good, as we can expect 95% of the points to fall within 3 standardized residuals of each other. ( (⎯⎯√3)≈1.73 )

Finally, the residuals vs fitted plot shows a well centered and mostly normal distribution about 0. There are no major trends or variation changes across the fitted values. This suggests that major uncorrelated variables have not been left out of the model.

plot(model3, which = 5)

plot(mod1, which=3)

plot(model3, which=1)

Model 3 CLM Assumptions: [To be finalized] * MLR1 Linear in paramters: The model has had its data transformed as described above to allow a linear fit of the model. * MLR2 Random Sampling: The data is collected from a data set with rolled up data for each county. It is not randomly sampled by area or population. * MLR3 No perfect multicollinearity: None of the variables chosen for the model are constant or perfectly collinear with each other. Our new variables for percentage minority and percentage young males may have some relationships with the other variables but they are not perfectly colinear as noted from the scatterplot matrix in our EDA. * MLR4’ The expectation of u and and covariance of each regressor with u are ~0. This shows that our model’s regressors are exogenous with the error.

By satisfying these assumptions, we can expect that our coefficients are approaching the true parameter values in probability.

##MLR 5,6 to be discussed in week 13…?

Results:

Comparison of Regression Models

*Can anyone figure out why logcrimJustEff is on 2 lines?

stargazer(mod1,model2,model3,type="text")

========================================================================================
                                            Dependent variable:                         
                    --------------------------------------------------------------------
                          logcrmrte                          logcrmrte                  
                             (1)                    (2)                    (3)          
----------------------------------------------------------------------------------------
allWages                   0.001***                                                     
                           (0.0001)                                                     
                                                                                        
logcrimJustEff            -1.001***                                                     
                           (0.174)                                                      
                                                                                        
logcrimJustEff                                   -0.635***              -0.867***       
                                                  (0.181)                (0.147)        
                                                                                        
logpolpc                                          0.398***               0.407***       
                                                  (0.140)                (0.111)        
                                                                                        
allWages                                          0.0003*                0.0003**       
                                                  (0.0002)               (0.0001)       
                                                                                        
taxpc                                              -0.004                -0.008**       
                                                  (0.005)                (0.004)        
                                                                                        
density                                           0.112***               0.090***       
                                                  (0.038)                (0.030)        
                                                                                        
logpctmin80                                                              0.240***       
                                                                         (0.034)        
                                                                                        
Constant                  -6.301***               -2.278*                -2.045**       
                           (0.374)                (1.163)                (0.923)        
                                                                                        
----------------------------------------------------------------------------------------
Observations                  89                     89                     89          
R2                          0.468                  0.567                  0.731         
Adjusted R2                 0.456                  0.541                  0.712         
Residual Std. Error    0.385 (df = 86)        0.354 (df = 83)        0.281 (df = 82)    
F Statistic         37.876*** (df = 2; 86) 21.732*** (df = 5; 83) 37.195*** (df = 6; 82)
========================================================================================
Note:                                                        *p<0.1; **p<0.05; ***p<0.01

Comparing the 3 models, we see that our adjusted R2 value has steadily increased from 0.456-0.732 as we introduce more covariates which indicates that we were able to explain more variation in our model not purely by increasing the number of indepedent variables.

At the same time, our standard errors have decreased insert more commentary on standard errors.

We see that by expanding our definitions of criminal justice efficiency and economic opportunity between model 1 and model 3 lowered the coefficients for logcrimJustEff and allWages. This is most likely because that we were able to better explain the effects with our newer variables.

Comment on practical significance after week 12

Policy Recommendations

Given that across all 3 models, we show that both criminal justice efficiency and tax revenues per capita have negative correlations to crime rate, we propose the policy recommendations below to address these issues. In addition, since minority percentages and density were found to be highly significant in the model 3, we believe our recommendations will be of particularly help to those running for political office in counties with a high percentage of minorities or dense urban populations.

  1. Since increasing both criminal justice and tax revenues are negatively correlated, we propose providing more funding for the local justice system.

  2. While increasing taxes on constituents may be difficult politically and may cost candidates the ballot, candidates can instead try to attract investment to bring more jobs with higher wages so you can increase revenues.

  3. Candidates can also propose to levy taxes on things that could lead to crimes or violence such as alcohol and weapons.

  4. Given the significance and relatively large coefficient size of percentage minority, candidates should enroll local law enforcement into bias training.

Ommitted Variables

Expected correlation between omitted and included variables
Omitted Variable Crime Rate (\(B_k\)) Criminal Justice Effectiveness Economic Conditions
Education - unknown +
Social Services - unknown unknown
Unemployment + unknown -
Gang Activity + - -

The 4 major identified ommited variables are shown above. * Education is an important variable because of demographic insights it provides. First, adults with higher education are less likely to participate in Crime and are more likely to have better economic opportunity. Second, a strong school system is also likely correlated with less youth crime. Because of these expected correlations we are likely overestimating the economic conditions coefficient estimate. * Available Social Services could also lower crime. Citizens with strong social services support have more options to get help when they lack means for purchasing basic life needs. However this is more difficult to predict, as some social service projects, like homeless shelters, could lead to more criminal activity. * Unemployment is used as an important indicator of economic health and opportunity. This is would be highly correlated to economic conditions variables like sum of wages. This indicator variable if added to the model would decrease the magnitude of the sum of wage means coefficient estimate.
* Gang or Organized Crime is special case of crime that contains unique causes. It is expected that it would be negatively correlated with criminal justice effectiveness as large social pressures prevent witnesses from supporting prosecution. Gang crime is also negatively correlated with economic conditions. From these assumed correlations, we can say that criminal justice effectiveness and economic conditions are both underestimated compared to including gang activity operationalized variable in the model.

Conclusion

We have shown in this report 3 different models that seek to explain and model changes in the crime rate in North Carolina in 1980. We start with the fundamental premise that crime is caused by both criminal justice efficiency and economic conditions, and further develop our definition of these two key explanatory variables which each new model.

In Model 3, we were able to explain up to 73% of the variation in our data, and found statistical significance at the 95% level or better for each of our covariates. Of these, we believe that increasing the efficiency of the criminal justice system and tax revenues were the most important, particularly for counties with high density and minority populations. However, our findings should be noted with caution as we were unable to study the effect of several ommitted variables including education, availability of social services, unemployment rates and the presence of organized crime. Had we been able to collect data on these variables and apply them in our model, we believe we could increase accuracy without bias.

#for the metro areas:
#increases in wsta and wtuc correlate with less crime
#increases in wfed and wser correlate with the same crime (ceterus paribus)
#increases in wcon, wtrd, wfir, wmfg and wloc correlate with more crime

#for the non-metro areas
#no increase in wage correlates with crime rates going down. They all go up.


# fit parallel slopes
mod<-lm(logcrmrte~wsta*metro, data=dfCrime)
#mod

# Augment the model
augmented_mod <- augment(mod)
#glimpse(augmented_mod)

# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wsta, y = logcrmrte, color = metro)) + 
  geom_point()

# single call to geom_line()
data_space + 
  geom_line(aes(y = .fitted))

# interaction plot
ggplot(dfCrime, aes(y = crmrte, x = wsta, color = metro)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

# fit parallel slopes
mod<-lm(crmrte~wsta*metro, data=dfCrime)
#mod

# Augment the model
augmented_mod <- augment(mod)
#glimpse(augmented_mod)

# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wsta, y = crmrte, color = metro)) + 
  geom_point()

# single call to geom_line()
data_space + 
  geom_line(aes(y = .fitted))

slr <- ggplot(dfCrime, aes(y = crmrte, x = wsta)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = 0)

# model with one slope
lm(crmrte ~ wsta, data = dfCrime)

Call:
lm(formula = crmrte ~ wsta, data = dfCrime)

Coefficients:
(Intercept)         wsta  
  3.617e-03    8.439e-05  
# plot with two slopes
slr + aes(color = metro)

newmod<-lm(crmrte ~ wsta + metro + other  + prbarr + wsta*metro  + prbarr*other, data = dfCrime)
summary(newmod)

Call:
lm(formula = crmrte ~ wsta + metro + other + prbarr + wsta * 
    metro + prbarr * other, data = dfCrime)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.026251 -0.008248 -0.000030  0.006076  0.043253 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)  
(Intercept)       3.203e-02  1.579e-02   2.028   0.0458 *
wsta              1.313e-05  3.755e-05   0.350   0.7274  
metroInside       1.068e-01  5.064e-02   2.109   0.0380 *
other             1.379e-02  8.507e-03   1.621   0.1089  
prbarr           -3.558e-02  1.892e-02  -1.881   0.0636 .
wsta:metroInside -1.780e-04  1.284e-04  -1.386   0.1695  
other:prbarr     -1.570e-02  2.761e-02  -0.569   0.5710  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01344 on 82 degrees of freedom
Multiple R-squared:  0.5212,    Adjusted R-squared:  0.4862 
F-statistic: 14.88 on 6 and 82 DF,  p-value: 1.968e-11
coeftest(newmod, vcov=vcovHC)

t test of coefficients:

                    Estimate  Std. Error t value Pr(>|t|)  
(Intercept)       3.2027e-02  2.0230e-02  1.5831  0.11725  
wsta              1.3132e-05  5.2066e-05  0.2522  0.80150  
metroInside       1.0682e-01  9.2894e-02  1.1500  0.25350  
other             1.3787e-02  8.8453e-03  1.5586  0.12294  
prbarr           -3.5584e-02  1.9880e-02 -1.7899  0.07716 .
wsta:metroInside -1.7803e-04  2.2446e-04 -0.7931  0.42999  
other:prbarr     -1.5704e-02  2.3876e-02 -0.6578  0.51253  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
newmod<-lm(crmrte ~ regcode:log(allWages) + metro:(logcrimJustEff) , data = dfCrime)
summary(newmod)

Call:
lm(formula = crmrte ~ regcode:log(allWages) + metro:(logcrimJustEff), 
    data = dfCrime)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031821 -0.005497 -0.000697  0.004864  0.035446 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -0.384045   0.103166  -3.723 0.000360 ***
regcodeO:log(allWages)       0.049436   0.013005   3.801 0.000276 ***
regcodeW:log(allWages)       0.047364   0.012986   3.647 0.000464 ***
regcodeC:log(allWages)       0.048438   0.012890   3.758 0.000320 ***
metroOutside:logcrimJustEff -0.031162   0.005163  -6.036 4.39e-08 ***
metroInside:logcrimJustEff  -0.049307   0.005306  -9.293 1.88e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01063 on 82 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.6967,    Adjusted R-squared:  0.6782 
F-statistic: 37.66 on 5 and 82 DF,  p-value: < 2.2e-16
newmod %>% augment
# A tibble: 88 x 13
   .rownames crmrte regcode log.allWages. metro logcrimJustEff .fitted
   <chr>      <dbl> <fct>           <dbl> <fct>          <dbl>   <dbl>
 1 1         0.0356 C                8.02 Outs…         -0.803  0.0297
 2 2         0.0153 C                7.88 Outs…         -0.709  0.0199
 3 3         0.0130 W                7.85 Outs…         -0.924  0.0163
 4 4         0.0268 C                7.95 Outs…         -0.717  0.0232
 5 5         0.0106 W                7.92 Outs…         -0.607  0.0101
 6 6         0.0146 W                7.86 Outs…         -1.45   0.0332
 7 7         0.0296 O                7.93 Outs…         -0.721  0.0302
 8 8         0.0203 O                7.90 Outs…         -0.521  0.0225
 9 9         0.0304 O                7.84 Outs…         -0.959  0.0336
10 10        0.0222 O                8.05 Outs…         -0.700  0.0360
# … with 78 more rows, and 6 more variables: .se.fit <dbl>, .resid <dbl>,
#   .hat <dbl>, .sigma <dbl>, .cooksd <dbl>, .std.resid <dbl>
newmod<-lm(crmrte ~ regcode:log(allWages) + metro + logcrimJustEff + metro*logcrimJustEff, data = dfCrime)
summary(newmod)

Call:
lm(formula = crmrte ~ regcode:log(allWages) + metro + logcrimJustEff + 
    metro * logcrimJustEff, data = dfCrime)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031480 -0.005701 -0.000744  0.004381  0.035698 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -0.395122   0.104840  -3.769 0.000310 ***
metroInside                -0.028355   0.042496  -0.667 0.506508    
logcrimJustEff             -0.030767   0.005214  -5.901 8.05e-08 ***
regcodeO:log(allWages)      0.050876   0.013226   3.847 0.000238 ***
regcodeW:log(allWages)      0.048818   0.013212   3.695 0.000398 ***
regcodeC:log(allWages)      0.049863   0.013109   3.804 0.000275 ***
metroInside:logcrimJustEff -0.041447   0.035154  -1.179 0.241844    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01066 on 81 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.6983,    Adjusted R-squared:  0.676 
F-statistic: 31.25 on 6 and 81 DF,  p-value: < 2.2e-16
newmod<-lm(crmrte ~ regcode:log(allWages) + metro*logcrimJustEff, data = dfCrime)
summary(newmod)

Call:
lm(formula = crmrte ~ regcode:log(allWages) + metro * logcrimJustEff, 
    data = dfCrime)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.031480 -0.005701 -0.000744  0.004381  0.035698 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -0.395122   0.104840  -3.769 0.000310 ***
metroInside                -0.028355   0.042496  -0.667 0.506508    
logcrimJustEff             -0.030767   0.005214  -5.901 8.05e-08 ***
regcodeO:log(allWages)      0.050876   0.013226   3.847 0.000238 ***
regcodeW:log(allWages)      0.048818   0.013212   3.695 0.000398 ***
regcodeC:log(allWages)      0.049863   0.013109   3.804 0.000275 ***
metroInside:logcrimJustEff -0.041447   0.035154  -1.179 0.241844    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01066 on 81 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.6983,    Adjusted R-squared:  0.676 
F-statistic: 31.25 on 6 and 81 DF,  p-value: < 2.2e-16

crmrte + prbarr + prbconv + prbpris + avgsen + polpc + density + taxpc + pctmin80 + wcon + wtuc + wtrd + wfir + wser + wmfg + wfed + wsta + wloc + mix + pctymle

#Other
options(repr.plot.width=10, repr.plot.height=10)
myData<-dfCrime %>% filter(region==0)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
           "pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
           "mix", "pctymle")]
myData %>% correlate() %>% network_plot()

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')

#Western
myData<-dfCrime %>% filter(region==1)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
           "pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
           "mix", "pctymle")]
myData %>% correlate() %>% network_plot()

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')

#Central
myData<-dfCrime %>% filter(region==2)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
           "pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
           "mix", "pctymle")]
myData %>% correlate() %>% network_plot()

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')

#metro = urban
myData<-dfCrime %>% filter(urban==1)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
           "pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
           "mix", "pctymle")]
myData %>% correlate() %>% network_plot()

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')

#metro = non-urban
myData<-dfCrime %>% filter(urban==0)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
           "pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
           "mix", "pctymle")]
myData %>% correlate() %>% network_plot()

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')

#metro = urban
myData<-dfCrime %>% filter(urban==1)
myData<-myData[, c("crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc", "density", "taxpc",
           "pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
           "mix", "pctymle")]
#More correlation charts - 

chart.Correlation(myData, histogram=TRUE, pch=19)

pairs.panels(myData, scale=TRUE)

ggpairs(myData)

ggcorrplot(cor(myData), p.mat = cor_pmat(myData), hc.order=TRUE, type='lower')

corrplot(cor(myData),method='circle', type = 'lower')

myData<-dfCrime[, c("county", "urban", "nonurban", "west", "central", "other", "density", "crmrte", "prbarr", "prbconv", "prbpris", "avgsen", "polpc",  "taxpc",
           "pctmin80", "wcon", "wtuc", "wtrd", "wfir", "wser", "wmfg", "wfed", "wsta", "wloc",
           "mix", "pctymle")]
myData %>% arrange(desc(density)) 
   county urban nonurban west central other   density    crmrte   prbarr
1     119     1        0    0     1.0     0 8.8276520 0.0989659 0.149094
2      67     1        0    0     1.0     0 6.4271846 0.0614177 0.217215
3     129     1        0    0     0.0     1 6.2864866 0.0834982 0.236601
4      63     1        0    0     1.0     0 5.6744967 0.0706599 0.133225
5      81     1        0    0     1.0     0 5.1244240 0.0604498 0.300215
6      71     1        0    0     0.5     0 4.8347340 0.0544061 0.243119
7     183     1        0    0     1.0     0 4.3887587 0.0568423 0.204216
8      51     1        0    0     0.0     1 3.9345510 0.0883849 0.155248
9      35     0        1    0     1.0     0 2.9242425 0.0408569 0.266026
10     21     1        0    1     0.0     0 2.6024280 0.0437355 0.234760
11     25     0        1    0     1.0     0 2.5741758 0.0302542 0.323548
12      1     0        1    0     1.0     0 2.4226327 0.0356036 0.298270
13     57     0        1    0     1.0     0 2.2518249 0.0300216 0.222002
14    135     0        1    0     1.0     0 2.1575000 0.0628972 0.092770
15    159     0        1    0     1.0     0 2.0192678 0.0362330 0.243590
16     45     0        1    0     1.0     0 1.8440171 0.0362807 0.202627
17     89     0        1    1     0.0     0 1.8155080 0.0283836 0.296646
18    191     0        1    0     0.0     1 1.7725632 0.0458895 0.172257
19    195     0        1    0     0.0     1 1.7459893 0.0313973 0.201397
20    133     0        1    0     0.0     1 1.6500655 0.0551287 0.266960
21    105     0        1    0     1.0     0 1.5984555 0.0514152 0.381400
22    109     0        1    0     1.0     0 1.5939597 0.0230995 0.162769
23    181     0        1    0     1.0     0 1.5702811 0.0729479 0.182590
24     97     0        1    0     1.0     0 1.5662020 0.0334506 0.302231
25    147     0        1    0     0.0     1 1.5159818 0.0551686 0.221542
26     23     0        1    1     0.0     0 1.5119047 0.0269836 0.289121
27    157     0        1    0     1.0     0 1.5096661 0.0305908 0.278287
28    173     0        1    1     0.0     0 1.5054216 0.0139937 0.530435
29    107     0        1    0     0.0     1 1.5024875 0.0497552 0.212895
30     27     0        1    1     0.0     0 1.4989384 0.0382489 0.268018
31    127     0        1    0     0.0     1 1.3388889 0.0291496 0.179616
32    139     0        1    0     0.0     1 1.3333334 0.0243470 0.522696
33    179     0        1    0     1.0     0 1.2816901 0.0318720 0.377543
34    167     0        1    0     1.0     0 1.2752526 0.0238285 0.362270
35    151     0        1    0     1.0     0 1.2737643 0.0264557 0.299198
36     65     0        1    0     0.0     1 1.1679842 0.0658801 0.287330
37    171     0        1    1     0.0     0 1.1521336 0.0243954 0.175649
38     49     0        1    0     0.0     1 1.1440799 0.0374979 0.264420
39    155     0        1    0     0.0     1 1.1296101 0.0312279 0.408200
40    189     0        1    1     0.0     0 1.1019108 0.0313130 0.161381
41     85     0        1    0     0.0     1 1.0815308 0.0490712 0.146132
42    165     0        1    0     0.0     1 1.0783699 0.0508341 0.340679
43      3     0        1    0     1.0     0 1.0463320 0.0152532 0.132029
44     59     0        1    0     1.0     0 1.0262172 0.0233327 0.227753
45    161     0        1    1     0.0     0 1.0052817 0.0200070 0.482425
46    101     0        1    0     0.0     1 0.9962264 0.0409403 0.149936
47    153     0        1    0     1.0     0 0.9622641 0.0317563 0.345368
48    197     0        1    1     0.0     0 0.8898810 0.0141928 0.207595
49     87     0        1    1     0.0     0 0.8648649 0.0286781 0.215108
50    111     0        1    1     0.0     0 0.8306636 0.0183048 0.202112
51    125     0        1    0     1.0     0 0.8202568 0.0266287 0.269043
52    193     0        1    1     0.0     0 0.8138298 0.0235277 0.266055
53    169     0        1    0     1.0     0 0.8008850 0.0121033 0.343387
54     83     0        1    0     0.0     1 0.7817680 0.0315752 0.456394
55    145     0        1    0     1.0     0 0.7788945 0.0299856 0.354733
56     41     0        1    0     0.0     1 0.7417582 0.0257713 0.307246
57     77     0        1    0     1.0     0 0.7172285 0.0441957 0.190876
58     69     0        1    0     1.0     0 0.7125506 0.0173158 0.283505
59    175     0        1    1     0.0     0 0.6878307 0.0164932 0.350348
60     91     0        1    0     0.0     1 0.6713483 0.0384263 0.343074
61     79     0        1    0     0.0     1 0.6203008 0.0156759 0.411538
62     11     0        1    1     0.0     0 0.6113361 0.0146067 0.524664
63     93     0        1    0     0.0     1 0.6112532 0.0362222 0.338902
64    149     0        1    1     0.0     0 0.6092437 0.0164987 0.271967
65    117     0        1    0     0.0     1 0.5813449 0.0268723 0.370474
66     19     0        1    0     0.0     1 0.5767442 0.0221567 0.162860
67     47     0        1    0     0.0     1 0.5639659 0.0313623 0.182927
68     99     0        1    1     0.0     0 0.5478615 0.0171865 0.153846
69      9     0        1    1     0.0     0 0.5469484 0.0106232 0.518219
70    163     0        1    0     0.0     1 0.5353749 0.0215728 0.310987
71     53     0        1    0     0.0     1 0.5351562 0.0140655 0.303191
72     33     0        1    0     1.0     0 0.5257009 0.0159189 0.270950
73     13     0        1    0     0.0     1 0.5169492 0.0296409 0.365004
74     37     0        1    0     1.0     0 0.5127119 0.0226017 0.321867
75     55     0        1    0     0.0     1 0.5115089 0.0790163 0.224628
76     61     0        1    0     0.0     1 0.5079365 0.0233677 0.398119
77    123     0        1    0     1.0     0 0.4938776 0.0300184 0.487430
78      7     0        1    0     1.0     0 0.4915572 0.0267532 0.364760
79     39     0        1    1     0.0     0 0.4623894 0.0119154 0.308333
80    113     0        1    1     0.0     0 0.4487427 0.0142071 0.179878
81    187     0        1    0     0.0     1 0.4427711 0.0345231 0.332669
82    143     0        1    0     0.0     1 0.4349594 0.0265806 0.317857
83      5     0        1    1     0.0     0 0.4127659 0.0129603 0.444444
84    131     0        1    0     0.0     1 0.4126394 0.0189848 0.689024
85    185     0        1    0     1.0     0 0.3887588 0.0108703 0.195266
86     17     0        1    0     0.0     1 0.3503982 0.0304289 0.251599
87    137     0        1    0     0.0     1 0.3167155 0.0126662 0.207143
88     15     0        1    0     0.0     1 0.3009986 0.0202814 0.392111
89    141     0        1    0     0.0     1 0.3005714 0.0314610 0.238636
     prbconv  prbpris avgsen      polpc    taxpc  pctmin80     wcon
1  0.3478000 0.486183   7.13 0.00223135 75.67243 0.2854600 436.7666
2  0.2482760 0.488426  10.57 0.00217747 45.89987 0.2565460 206.5527
3  0.3934130 0.415158   9.57 0.00255849 67.67963 0.2304410 315.5760
4  0.4592160 0.363636  11.51 0.00237609 50.19918 0.3822300 349.3267
5  0.2037250 0.431020  14.42 0.00243562 44.21059 0.2639410 404.8150
6  0.2295900 0.379175  11.29 0.00207028 31.53658 0.1331500 291.4508
7  0.3819080 0.367347  12.15 0.00212751 48.76492 0.2344360 360.9549
8  0.2598330 0.407628  11.93 0.00190802 35.69936 0.3777920 283.6695
9  0.3253010 0.370370  10.06 0.00189444 56.86211 0.1008380 346.5888
10 0.3347010 0.429072  10.62 0.00182958 52.62629 0.0962444 313.4738
11 0.4067800 0.492647   8.01 0.00197140 33.03621 0.1509980 315.7290
12 0.5275960 0.436170   6.71 0.00182786 30.99368 0.2021870 281.4259
13 0.7369090 0.320624  10.47 0.00136191 28.59199 0.1099570 324.6088
14 0.4777330 0.385593  11.92 0.00233871 35.99248 0.1952630 316.0858
15 0.4929400 0.476563   8.64 0.00158619 27.76489 0.1699130 334.1035
16 0.4505670 0.474820   8.96 0.00121531 30.84900 0.2174990 318.3808
17 0.3869260 0.415525   5.51 0.00126447 32.48096 0.0431205 293.8034
18 0.4500000 0.421053   9.59 0.00122733 32.74533 0.3442830 318.0599
19 1.6705199 0.470588  13.02 0.00445923 53.66693 0.3743110 315.1641
20 0.2719470 0.334951   8.99 0.00154457 27.46926 0.2638140 264.0406
21 0.3842360 0.381410   8.81 0.00195614 67.84798 0.2354980 314.0595
22 0.7816090 0.411765   9.12 0.00108043 27.32160 0.0968564 270.0951
23 0.3430230 0.548023   7.06 0.00172948 27.59179 0.4462830 244.8362
24 0.5950780 0.409774   6.62 0.00152665 29.97040 0.1840120 333.0927
25 0.4267780 0.443137   7.73 0.00218874 36.18621 0.3564110 295.1822
26 0.4037800 0.365957   7.07 0.00146111 29.08280 0.0793198 284.9890
27 0.4436810 0.377709   7.48 0.00191777 39.16965 0.2160360 317.5345
28 0.3278690 0.150000   6.64 0.00316379 37.72702 0.2539140 231.6960
29 0.3643530 0.450216   8.47 0.00168747 53.17796 0.3904780 279.4214
30 0.3529410 0.321429  11.69 0.00145013 43.06339 0.0645795 292.7350
31 1.3581400 0.335616  15.99 0.00158289 32.02376 0.3427990 290.9091
32 0.2894740 0.345455  14.22 0.00167448 29.49915 0.3806130 278.0824
33 0.3286640 0.426230   9.90 0.00147820 38.44067 0.1788580 314.4250
34 0.3225810 0.371429  10.48 0.00155144 35.09686 0.1257910 293.5538
35 0.3601530 0.340426  12.57 0.00132430 25.69287 0.0697606 340.4792
36 0.1544520 0.403922   9.84 0.00185739 30.62824 0.5169320 362.1527
37 0.9090910 0.458333   8.67 0.00157442 31.15306 0.0564349 385.3424
38 0.3718790 0.356890   8.70 0.00148532 39.23048 0.2990720 292.8322
39 0.5598230 0.386544   8.71 0.00145925 31.37446 0.6144970 253.2447
40 0.3005780 0.288462  12.27 0.00227837 31.33022 0.0198320 238.3758
41 0.5490200 0.428571   8.78 0.00143730 27.16143 0.2562870 245.8896
42 0.4685310 0.432836   7.42 0.00151382 38.74739 0.4242560 309.1764
43 1.4814800 0.450000   6.35 0.00074588 26.89208 0.0791632 255.1020
44 0.6225170 0.425532   6.50 0.00119655 41.07194 0.1087040 280.8989
45 0.5081970 0.451613   7.98 0.00124824 31.34530 0.1303660 275.4970
46 0.5714290 0.473881   9.65 0.00140045 34.87021 0.2094790 245.6298
47 0.5207100 0.458333  11.33 0.00138447 37.22833 0.2852840 262.6642
48 1.1829300 0.360825  12.23 0.00118573 25.95258 0.0546081 314.1660
49 0.5484950 0.341463  11.11 0.00169180 32.82694 0.0269921 250.7271
50 0.5223880 0.542857  11.06 0.00118719 29.58239 0.0552725 245.1514
51 0.4389610 0.396450   7.36 0.00200971 48.15414 0.2258190 261.5648
52 0.5888590 0.423423   5.86 0.00117887 28.51783 0.0593109 285.8289
53 0.7229730 0.448598  12.36 0.00109520 37.70785 0.0767092 345.6391
54 0.4572100 0.410256   7.85 0.00138532 41.08650 0.5056250 269.1710
55 0.3404910 0.594595   8.47 0.00137040 44.64758 0.3295160 305.3435
56 0.4528300 0.520833  17.41 0.00149399 41.76929 0.4264210 256.4102
57 0.5283020 0.488095   9.60 0.00246711 29.70588 0.4545130 254.7925
58 0.7393940 0.418033   9.10 0.00107108 35.37642 0.4232240 372.1622
59 0.4105960 0.387097   7.31 0.00164549 46.41461 0.0603408 253.3395
60 0.5899050 0.454545   8.79 0.00162189 31.85298 0.5680850 289.2227
61 0.3084110 0.454545   6.19 0.00102496 37.50189 0.4546750 223.9199
62 0.0683761 0.500000  13.00 0.00288203 35.22974 0.0154070 250.4006
63 0.5739440 0.484663   5.45 0.00142641 27.47967 0.5808270 231.1391
64 1.0153800 0.227273  14.62 0.00151871 29.03402 0.1000460 223.6136
65 0.7932330 0.236967  11.83 0.00119765 38.81493 0.4542550 242.3077
66 1.2256100 0.333333  10.34 0.00202425 61.15251 0.2431170 260.1381
67 0.7633330 0.270742   7.79 0.00128127 32.66050 0.3340320 367.8286
68 1.2343800 0.556962  14.75 0.00185912 39.57348 0.1428460 259.7841
69 0.4765630 0.442623   8.22 0.00086018 28.05474 0.0179619 292.3077
70 0.4011980 0.455224  11.25 0.00136587 34.01291 0.3634950 217.1781
71 0.1403510 0.250000  11.96 0.00112225 50.38139 0.1790960 266.4504
72 0.5154640 0.480000   7.32 0.00075593 27.38110 0.4391690 218.8868
73 0.5206070 0.420833  10.55 0.00133771 30.69649 0.3217940 238.3064
74 0.3854960 0.316832   8.69 0.00130501 34.70248 0.2780790 307.2780
75 0.2078310 0.304348  13.57 0.00400962 34.48254 0.0649622 309.5238
76 0.4934380 0.361702   8.77 0.00141622 32.59961 0.3499950 244.2002
77 0.2263610 0.443038   6.49 0.00176086 38.45734 0.2596020 345.1677
78 0.5254240 0.435484   7.14 0.00152994 42.94759 0.4791610 375.2345
79 0.9729730 0.291667  11.58 0.00119154 27.27564 0.0394549 277.5575
80 0.2203390 0.461538   6.39 0.00151600 40.80142 0.0239865 244.7552
81 0.4431140 0.432432   6.98 0.00116911 34.71814 0.4421320 264.4231
82 0.3146070 0.250000   9.36 0.00085438 31.22779 0.3863590 250.8361
83 0.2678570 0.600000   6.76 0.00123431 34.81605 0.0316053 226.9470
84 0.4955750 0.401786   9.97 0.00121549 37.70006 0.6194210 225.8647
85 2.1212101 0.442857   5.38 0.00122210 40.82454 0.6434820 226.8245
86 0.4364410 0.436893   7.32 0.00129761 34.96204 0.4038900 193.6432
87 1.0689700 0.322581   6.18 0.00081426 44.29367 0.3304480 299.4956
88 0.7692310 0.507692  10.64 0.00103525 34.00304 0.6105400 253.5926
89 0.4126980 0.487179  13.18 0.00127115 35.97390 0.3985150 257.9737
       wtuc     wtrd     wfir     wser   wmfg   wfed   wsta   wloc
1  548.3239 354.6761 509.4655 354.3007 494.30 568.40 329.22 379.77
2  379.5547 189.1807 278.0352 230.4981 275.72 419.07 400.59 313.55
3  392.0999 220.4530 363.2880 292.7027 464.49 548.49 421.36 319.08
4  548.9865 238.9154 435.1107 391.3081 646.85 563.77 415.51 362.58
5  489.3144 308.5762 420.8864 305.1543 448.86 563.72 426.47 333.64
6  595.3719 240.3673 348.0254 295.2301 358.95 509.43 359.11 339.58
7  528.5593 306.0835 430.0697 348.2754 444.45 597.95 453.08 362.99
8  412.4720 213.7524 324.8357 257.3344 441.72 433.94 367.34 333.71
9  469.2220 277.2925 359.3117 296.8491 341.32 525.51 360.68 333.32
10 433.8580 228.1740 363.7671 318.3635 378.90 496.13 381.30 335.36
11 384.6154 220.5897 358.6328 318.0335 355.59 486.36 411.08 357.44
12 408.7245 221.2701 453.1722 274.1775 334.54 477.58 292.09 311.91
13 418.6380 225.5296 338.5699 261.3512 324.67 496.07 325.15 314.01
14 420.8830 179.1289 389.8522 292.2253 388.75 509.95 499.59 333.05
15 475.3228 260.2710 329.5464 265.4315 374.41 491.16 346.81 351.74
16 403.0558 248.7759 301.8632 293.1148 367.42 463.37 352.35 320.82
17 501.9174 214.2170 334.1840 258.3238 443.70 496.28 308.01 305.71
18 400.8570 230.9888 320.0345 238.4958 295.26 334.55 375.45 327.62
19 377.9356 246.0614 411.4330 296.8684 392.27 480.79 303.11 337.28
20 318.9644 183.2609 265.1232 230.6581 258.25 326.10 329.43 301.64
21 401.1326 237.8523 323.9486 274.8686 352.08 463.11 267.78 343.13
22 374.0296 226.4881 335.4204 230.3086 320.20 453.61 389.34 302.93
23 365.4716 279.2273 325.0271 213.5822 290.69 453.53 317.23 286.45
24 421.6986 222.9622 338.7534 282.9701 346.51 480.14 351.17 330.93
25 379.8962 205.4827 377.6978 274.6765 390.01 543.21 464.63 330.78
26 400.7398 213.1840 325.0271 315.0242 327.45 463.67 361.21 308.32
27 409.7771 192.4181 302.6162 245.7938 418.48 478.48 342.13 318.07
28 213.6752 175.1604 267.0940 204.3792 193.01 334.44 414.68 304.32
29 350.1326 202.9344 333.9183 276.2629 350.63 496.75 332.12 324.97
30 428.5023 213.4620 316.9436 292.3517 309.27 450.28 283.60 310.85
31 426.3901 257.6008 441.1413 305.7612 329.87 508.61 380.30 329.71
32 441.5954 201.3569 302.6724 260.5459 264.25 452.59 348.05 285.47
33 361.1018 241.0034 342.6819 270.4866 349.63 459.32 387.16 376.45
34 419.0362 221.0995 326.8283 253.0096 351.09 463.69 312.53 311.47
35 415.1317 218.7198 322.4150 278.1124 294.37 474.26 298.66 294.72
36 540.1061 209.0579 316.2955 216.4589 313.71 543.03 348.88 329.16
37 412.5212 224.7224 339.9547 253.6207 288.32 452.53 341.77 324.06
38 406.5041 212.4851 312.6954 304.3066 375.32 474.30 297.69 290.48
39 407.0929 200.2213 337.9095 262.9849 278.82 441.49 381.46 304.73
40 354.2510 180.9359 369.4332 253.2281 304.72 427.84 451.79 297.19
41 448.7180 198.6254 248.1390 264.6729 310.86 437.03 397.54 327.37
42 471.3424 203.0162 294.4124 267.6852 374.25 442.38 381.14 285.50
43 376.2542 196.0101 258.5650 192.3077 300.38 409.83 362.96 301.47
44 335.4590 210.3365 317.3077 316.6645 316.57 428.33 388.92 307.25
45 402.5045 201.5675 279.2492 251.4202 334.92 450.28 277.60 302.83
46 378.1590 203.1547 307.2585 232.1825 325.03 459.90 350.27 317.12
47 294.6650 192.8994 267.3797 237.1590 301.29 467.08 350.24 302.25
48 341.8803 182.8020 348.1432 212.8205 322.92 391.72 385.65 306.85
49 437.9284 196.5065 288.4615 243.4706 588.99 488.76 346.32 294.21
50 571.5438 179.5039 327.9666 251.4270 260.35 384.15 354.41 304.91
51 457.5350 199.5847 299.7388 320.1325 277.68 447.87 361.24 300.77
52 480.1948 268.3836 365.0196 295.9352 295.63 468.26 337.88 348.74
53 588.6970 190.5555 331.5650 206.6858 306.42 406.62 348.37 306.68
54 480.7692 200.3768 297.2437 239.2233 360.27 466.61 367.24 302.63
55 538.8488 200.7432 302.1978 219.6343 334.65 414.63 298.78 295.00
56 379.0005 238.5589 271.7391 232.5916 332.07 451.84 389.99 312.05
57 391.7379 197.6995 311.3553 199.4458 360.21 512.30 369.75 329.34
58 508.2035 266.7794 466.0016 347.6609 560.78 516.05 381.03 388.09
59 402.5045 172.6453 317.3394 257.8734 627.02 344.06 357.69 298.33
60 431.9210 198.0064 253.3395 266.4674 316.19 430.64 314.75 297.34
61 320.5128 171.0329 334.4482 250.0000 218.30 353.31 385.19 323.08
62 401.3378 187.8255 258.5650 237.1507 258.60 391.48 325.71 275.22
63 187.6173 161.3771 254.5249 182.0196 298.78 471.09 400.11 292.98
64 437.0629 188.7683 353.2182 210.4415 289.43 421.34 342.92 301.23
65 424.7286 167.0511 282.4099 229.0151 327.29 383.88 360.66 302.03
66 613.2261 191.2452 290.5141 266.0934 567.06 403.15 258.33 299.44
67 342.5724 194.6777 352.6135 256.4102 355.83 426.56 313.71 313.84
68 417.2099 168.2692 301.5734 247.6291 258.99 442.76 387.02 291.44
69 377.3126 206.8215 289.3125 215.1933 290.89 377.35 367.23 342.82
70 415.2824 196.5514 279.0347 238.6300 295.07 429.27 318.02 299.92
71 202.4292 219.7802 305.9441 223.8502 250.42 371.79 383.72 296.64
72 286.4157 195.1995 368.2488 172.4733 324.45 357.16 407.54 268.44
73 366.3004 205.5358 310.1737 259.3391 303.42 449.84 350.72 283.76
74 462.4408 227.7858 305.6546 231.3615 321.90 460.62 393.29 321.53
75 445.2762 189.7436 284.5933 221.3903 319.21 338.91 361.68 326.08
76 308.5150 214.2203 327.1213 203.8864 266.58 414.84 333.46 308.05
77 396.2704 193.0723 272.2941 242.4605 277.34 345.09 328.00 325.77
78 397.6901 191.1720 281.0651 256.7214 281.80 412.15 328.27 299.03
79 390.1895 180.0634 295.8580 246.0152 270.78 397.33 313.06 239.17
80 412.0879 154.2090 256.4102 265.1301 291.10 337.09 374.11 246.65
81 421.3483 170.5293 282.0513 183.1502 297.14 390.94 356.91 267.08
82 373.9316 178.3723 234.5216 133.0431 157.41 380.97 388.43 253.66
83 372.2084 229.3209 305.9441 209.6972 237.65 358.98 331.53 281.37
84 375.2345 220.9747 307.6923 172.6281 278.70 432.81 370.81 259.78
85 331.5650 167.3726 264.4231 266.7583 247.72 381.33 367.25 300.13
86 346.6011 202.9595 268.3363 208.2520 339.76 389.51 322.06 278.39
87 356.1254 170.8711 170.9402 250.8361 192.96 360.84 283.90 321.73
88 353.2182 199.2377 356.1254 206.2816 235.05 416.49 370.62 297.13
89 388.3136 179.7050 263.7363 196.1453 255.56 374.78 380.29 294.98
          mix    pctymle
1  0.16869897 0.07916495
2  0.21999694 0.07647973
3  0.07871422 0.08109921
4  0.07585382 0.09468981
5  0.10255422 0.08310476
6  0.10186080 0.07939028
7  0.08527010 0.09935585
8  0.10474319 0.14223780
9  0.18122160 0.07891140
10 0.06289308 0.07219726
11 0.08848315 0.07641540
12 0.08016878 0.07787097
13 0.07497714 0.07873024
14 0.05091770 0.13302912
15 0.09146758 0.07705218
16 0.08594864 0.08013537
17 0.09090909 0.06795343
18 0.08616445 0.08828809
19 0.15612382 0.07945071
20 0.12176319 0.24871162
21 0.10368066 0.07365920
22 0.05424063 0.08050946
23 0.10003893 0.07977433
24 0.08431085 0.07655433
25 0.09345226 0.11421655
26 0.08400646 0.08397806
27 0.12467756 0.07771273
28 0.41975310 0.07462687
29 0.11912815 0.07455523
30 0.08957055 0.08353864
31 0.06305506 0.07400288
32 0.34879407 0.09625563
33 0.13481072 0.08703093
34 0.10925926 0.07687439
35 0.07342084 0.07763254
36 0.09364294 0.07622346
37 0.08207343 0.07735254
38 0.10692308 0.12224479
39 0.12516960 0.08183564
40 0.05719921 0.15092644
41 0.14886613 0.10694169
42 0.09595300 0.08548180
43 0.03022670 0.08260694
44 0.11616162 0.07756928
45 0.11024390 0.07362188
46 0.09447166 0.07609449
47 0.16323297 0.07570874
48 0.06756757 0.07419893
49 0.09794629 0.07600891
50 0.06763285 0.06861899
51 0.07110778 0.07415335
52 0.11050157 0.07819394
53 0.13720317 0.08280677
54 0.09106830 0.07521905
55 0.22533333 0.07275662
56 0.09872611 0.06355526
57 0.17905165 0.08345764
58 0.07977737 0.08181948
59 0.10230179 0.08436658
60 0.09869203 0.08729226
61 0.28078818 0.08120255
62 0.31952664 0.09891920
63 0.18528996 0.08633697
64 0.11682243 0.06215772
65 0.07485030 0.07632116
66 0.05334728 0.07713232
67 0.12405757 0.07381025
68 0.01960784 0.12894706
69 0.06008584 0.07069755
70 0.12108560 0.07472797
71 0.08045977 0.08476309
72 0.15112540 0.08005202
73 0.15237226 0.07073344
74 0.08100930 0.07154077
75 0.08437271 0.07613807
76 0.16707317 0.07579902
77 0.31135532 0.08119376
78 0.27362204 0.07353726
79 0.13744076 0.06973287
80 0.05128205 0.09171820
81 0.29048842 0.07794872
82 0.21212120 0.06769374
83 0.46511629 0.07211538
84 0.16725978 0.08356434
85 0.04968944 0.07008217
86 0.21818182 0.07769163
87 0.06870229 0.07098370
88 0.23495702 0.07430546
89 0.20000000 0.07572646
ggcorr(myData, nbreaks=8, palette='RdBu', label=TRUE, label_size=5, label_color='white')